Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 25, 6e - 24, 5e Distributed Databases CSE 4701 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 [email protected] http://www.engr.uconn.edu/~steve (860) 486 - 4818   A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. Remaining slides represent new material. Chaps25.1 Classical and Distributed Architectures  CSE 4701   Classic/Centralized DBMS Dominated the Commercial Market from 1970s Forward Problems of this Approach  Difficult to Scale w.r.t. Performance Gains  If DB Overloaded, replace with a Faster Computer  this can Only Go So Far - Disk Bottlenecks Distributed DBMS have Evolved to Address a Number of Issues  Improved Performance  Putting Data “Near” Location where it is Needed  Replication of Data for Fault Tolerance  Vertical and Horizontal Partitioning of DB Tuples Chaps25.2 Common Features of Centralized DBMS  CSE 4701   Data Independence  High-Level Representation via Conceptual and External Schemas  Physical Representation (Internal Schema) Hidden Program Independence  Multiple Applications can Share Data  Views/External Schema Support this Capability Reduction of Program/Data Redundancy  Single, Unique, Conceptual Schema  Shared Database  Almost No Data Redundancy  Controlled Data Access Reduces Inconsistencies  Programs Execute with Consistent Results Chaps25.3 Common Features of Centralized DBMS  CSE 4701 Promote Sharing: Automatically Provided via CC  No Longer Programmatic Issue  Most DBMS Offer Locking for Key Shared Data  Oracle Allows Locks on Data Item (Attributes)  For Example, Controlling Access to Shared Identifier     Coherent and Central DB Administration Semantic DB Integrity via the Automatic Enforcement of Data Consistency via Integrity Constraints/Rules Data Resiliency  Physical Integrity of Data in the Presence of Faults and Errors  Supported by DB Recovery Data Security: Control Access for Authorized Users Against Sensitive Data Chaps25.4 Shared Nothing Architecture  CSE 4701    In this Architecture, Each DBMS Operates Autonomously There is No Sharing  Three Separate DBMSs on Three Different Computers Applications/Clients Must Know About the External Schemas of all Three DBMSs for  Database Retrieval  Client Processing Complicates Client  Different DBMS Platforms (Oracle, Sybase, Informix, ..)  Different Access Modes (Query, Embedded, ODBC)  Difficult for SWE to Code Chaps25.5 Difficulty in Access – Manage Multiple APIs  CSE 4701 Each Platform has a Different API  API1 , API3 , …. , APIn  An App Programmer Must Utilize All three APIs which could differ by PL – C++, C, Java, REST, etc.  Any interactions Across 3 DBs – must be programmatically handled without DB Capabilities API1 API2 APIn Chaps25.6 NW Architecture with Centralized DB  CSE 4701 High-Speed NWs/WANs Spawned Centralized DB Accessible Worldwide  Clients at Any Site can Access Repository  Data May be “Far” Away - Increased Access Time  In Practice, Each Remote Site Needs only Portion of the Data in DB1 and/or DB2  Inefficient, no Replication w.r.t. Failure Chaps25.7 Fully Distributed Architecture  CSE 4701    The Five Sites (Chicago, SF, LA, NY, Atlanta) each have a “Portion” of the Database - its Distributed Replication is Possible for Fault Tolerance Queries at one Site May Need to Access Data at Another Site (e.g., for a Join) Increased Transaction Processing Complexity Chaps25.8 Distributed Database Concepts  CSE 4701   A transaction can be executed by multiple networked computers in a unified manner. A distributed database (DDB) processes a Unit of execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as  Collection of multiple logically related database distributed over a computer network  Distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user. Chaps25.9 Goals of DDBMS  CSE 4701    Support User Distribution Across Multiple Sites  Remote Access by Users Regardless of Location  Distribution and Replication of Database Content Provide Location Transparency  Users Manipulate their Own Data  Non-Local Sites “Appear” Local to Any User Provide Transaction Control Akin to Centralized Case  Transaction Control Hides Distribution  CC and Serializability - Must be Extended Minimize Communications Cost  Optimize Use of Network - a Critical Issue  Distribute DB Design Supported by Partitioning (Fragmentation) and Replication Chaps25.10 Goals of DDBMS  CSE 4701   Improve Response Time for DB Access  Use a More Sophisticated Load Control for Transaction Processing  However, Synchronization Across Sites May Introduce Additional Overhead System Availability  Site Independence in the Presence of Site Failure  Subset of Database is Always Available  Replication can Keep All Data Available, Even When Multiple Sites Fail Modularity  Incremental Growth with the Addition of Sites  Dedicate Sites to Specific Tasks Chaps25.11 Advantages of DDBMS  CSE 4701 1. There are Four Major Advantages Transparency  Distribution/NW Transparency  User Doesn’t Know about NW Configuration (Location Transparency)  User can Find Object at any Site (Naming Transparency)  Replication Transparency (see next PPT)  User Doesn’t Know Location of Data  Replicas are Transparently Accessible  Fragmentation Transparency  Horizontal Fragmentation (Distribute by Row)  Vertical Fragmentation (Distribute by Column) Chaps25.12 Data Distribution and Replication CSE 4701 Chaps25.13 Other Advantages of DDBMS CSE 4701 2. Increased Reliability and Availability  Reliability - System Always Running  Availability - Data Always Present  Achieved via Replication and Distribution  Ability to Make Single Query for Entire DDBMS 3. Improved Performance  Sites Able to Utilize Data that is Local for Majority of Queries 4. Easier Expansion  Improve Performance of Site by  Upgrading Processor of Computer  Adding Additional Disks  Splitting a Site into Two or More Sites  Expansion over Time as Business Grows Chaps25.14 Challenges of DDBMS  CSE 4701    Tracking Data - Meta Data More Complex  Must Track Distribution (where is the Data)  V & H Fragmentation (How is Data Split)  Replication (Multiple Copies for Consistency) Distributed Query Processing  Optimization, Accessibility, etc., More Complex  Block Analysis of Data Size Must also Now Consider the NW Transmitting Time Distributed Transaction Processing  TP Potentially Spans Multiple Sites  Submit Query to Multiple Sites  Collect and Collate Results Distributed Concurrency Control Across Nodes Chaps25.15 Challenges of DDBMS  CSE 4701    Replicated Data Management  TP Must Choose the Replica to Access  Updates Must Modify All Replica Copies Distributed Database Recovery  Recovery of Individual Sites  Recovery Across DDBMS Security  Local and Remote Authorization  During TP, be Able to Verify Remote Privileges Distributed Directory Management  Meta-Data on Database - Local and Remote  Must maintain Replicas of this - Every Site Tracks the Meta-Data for All Sites Chaps25.16 A Complete Schema with Keys ... CSE 4701 Keys Allow us to Establish Links Between Relations Chaps25.17 …and Corresponding DB Tables CSE 4701 which Represent Tuples/Instances of Each Relation A S C null W B null null 1 4 5 5 Chaps25.18 …with Remaining DB Tables CSE 4701 Chaps25.19 What is Fragmentation?  CSE 4701  Fragmentation Divides a DB Across Multiple Sites Two Types of Fragmentation  Horizontal Fragmentation  Given a Relation R with n Total Tuples, Spread Entire Tuples Across Multiple Sites  Each Site has a Subset of the n Tuples  Essentially Fragmentation is a Selection  Vertical Fragmentation  Given a Relation R with m Attributes and n Total Tuples, Spread the Columns Across Multiple Sites  Essentially Fragmentation is a Projection  Not Generally Utilized in Practice  In Both Cases, Sites can Overlap for Replication Chaps25.20 Horizontal Fragmentation  CSE 4701     A horizontal subset of a relation which contain those of tuples which satisfy selection conditions. Consider Employee relation with condition DNO = 5 All tuples satisfying this create a subset which will be a horizontal fragment of Employee relation. A selection condition may be composed of several conditions connected by AND or OR. Derived horizontal fragmentation:  Partitioning of a primary relation to other secondary relations which are related with Foreign keys. Chaps25.21 Horizontal Fragmentation  Site 2 Tracks All Information Related to Dept. 5 CSE 4701 Chaps25.22 Horizontal Fragmentation  CSE 4701  Site 3 Tracks All Information Related to Dept. 4 Note that an Employee Could be Listed in Both Cases, if s/he Works on a Project for Both Departments Chaps25.23 Refined Horizontal Fragmentation  CSE 4701   Further Fragment from Site 2 based on Dept. that Employee Works in Notice that G1 + G2 + G3 is the Same as WORKS_ON5 there is no Overlap Chaps25.24 Refined Horizontal Fragmentation  CSE 4701   Further Fragment from Site 3 based on Dept. that Employee Works in Notice that G4 + G5 + G6 is the Same as WORKS_ON4 Note Some Fragments can be Empty Chaps25.25 Vertical Fragmentation  CSE 4701   Subset of a relation created via a subset of columns.  A vertical fragment of a relation will contain values of selected columns.  There is no selection condition used in vertical fragmentation.  A strict vertical slice/partition Consider the Employee relation.  A vertical fragment of can be created by keeping the values of Name, Bdate, Sex, and Address. Since no condition for creating a vertical fragment  Each fragment must include the primary key attribute of the parent relation Employee.  All vertical fragments of a relation are connected. Chaps25.26 Vertical Fragmentation Example  CSE 4701  Partition the Employee Table as Below Notice Each Vertical Fragment Needs Key Column EmpDemo EmpSupvrDept Chaps25.27 Homogeneous DDBMS  CSE 4701 Homogeneous  Identical Software (w.r.t. Database)  One DB Product (e.g., Oracle, DB2, Sybase) is Distributed and Available at All Sites  Uniformity w.r.t. Administration, Maintenance, Client Access, Users, Security, etc.  Interaction by Programmatic Clients is Consistent (e.g., JDBC or ODBC or REST API …) Chaps25.28 Non-Federated Heterogeneous DDBMS  CSE 4701 Non-Federated Heterogeneous  Different Software (w.r.t. Database)  Multiple DB Products (e.g., Oracle at One Site, MySQL at another, Sybase, Informix, etc.)  Replicated Administration (e.g., Users Needs Accounts on Multiple Systems)  Varied Programmatic Access - SWEs Must Know All Platforms/Client Software Complicated  Very Close to Shared Nothing Architecture Chaps25.29 Federated DDBMS  CSE 4701  Federated  Multiple DBMS Platforms Overlaid with a Global Schema View  Single External Schema Combines Schemas from all Sites Multiple Data Models  Relational in one Component DBS  Object-Oriented in another DBS  Hierarchical in a 3rd DBS Chaps25.30 Federated DBMS Issues  CSE 4701    Differences in Data Models  Reconcile Relational vs. Object-Oriented Models  Each Different Model has Different Capabilities  These Differences Must be Addressed in Order to Present a Federated Schema Differences in Constraints  Referential Integrity Constraints in Different DBSs  Different Constraints on “Similar” Data  Federated Schema Must Deal with these Conflicts Differences in Query Languages  SQL-89, SQL-92, SQL2, SQL3  Specific Types in Different DBMS (Oracle Blobs ) Differences in Key Processing & Timestamping Chaps25.31 Heterogeneous Distributed Database Systems  CSE 4701  Federated: Each site may run different database system but the data access is managed through a single conceptual schema.  The degree of local autonomy is minimum.  Each site must adhere to a centralized access policy  There may be a global schema. Multi-database: There is no one conceptual global schema  For data access a schema is constructed dynamically as needed by the application software. Object Unix Relational Unix Oriented Site 5 Site 1 Hierarchical Window Communications Site 4 network Object Oriented Network DBMS Site 3 Linux Site 2 Linux Relational Chaps25.32 Query Processing in Distributed Databases Issues CSE 4701  Cost of transferring data (files and results) over the network.  This cost is usually high so some optimization is necessary.  Example relations: Employee at site 1 and Department at Site 2 – Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 106 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno – Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate  Q: For each employee, retrieve employee name and department name Where the employee works.  Q: Fname,Lname,Dname (Employee Dno = Dnumber Department) Chaps25.33 Query Processing in Distributed Databases  CSE 4701 Result  The result of this query will have 10,000 tuples, assuming that every employee is related to a department.  Suppose each result tuple is 40 bytes long.  The query is submitted at site 3 and the result is sent to this site.  Problem: Employee and Department relations are not present at site 3. Chaps25.34 Query Processing in Distributed Databases  CSE 4701 Strategies: 1. Transfer Employee and Department to site 3.  Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.  Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 + 1,000,000 = 1,400,000 bytes. 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3.  Total bytes transferred = 400,000 + 3500 = 403,500 bytes.  Optimization criteria: minimizing data transfer.  Preferred approach: strategy 3. Chaps25.35 Query Processing in Distributed Databases  CSE 4701  Consider the query  Q’: For each department, retrieve the department name and the name of the department manager Relational Algebra expression:  Fname,Lname,Dname (Employee Mgrssn = SSN Department) Chaps25.36 Query Processing in Distributed Databases  CSE 4701 Result of query has 100 tuples, assuming that every department has a manager, the execution strategies are: 1. Transfer Employee and Department to the result site and perform the join at site 3.  Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3. Query result size = 40 * 100 = 4000 bytes.  Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes. 3. Transfer Department relation to site 1, execute join at site 1 and send the result to site 3.  Total transfer size = 4000 + 3500 = 7500 bytes.  Preferred strategy: Choose strategy 3. Chaps25.37 Query Processing in Distributed Databases  CSE 4701 Now suppose the result site is 2. Possible strategies : 1. Transfer Employee relation to site 2, execute the query and present the result to the user at site 2.  Total transfer size = 1,000,000 bytes for both queries Q and Q’. 2. Transfer Department relation to site 1, execute join at site 1 and send the result back to site 2.  Total transfer size for Q = 400,000 + 3500 = 403,500 bytes and for Q’ = 4000 + 3500 = 7500 bytes. Chaps25.38 DDBS Concurrency Control and Recovery  CSE 4701 Distributed Databases encounter a number of concurrency control and recovery problems which are not present in centralized databases, including:  Dealing with multiple copies of data items  How are they All Updated if Needed?  Failure of individual sites  How are Queries Restarted or Rerouted?  Communication link failure  Network Failure  Distributed commit  How to Know All Updates Done at all Sites?  Distributed deadlock  How to Detect and Recover? Chaps25.39 The Next Big Challenge  CSE 4701  Interoperability  Heterogeneous Distributed Databases  Heterogeneous Distributed Systems  Autonomous Applications Scalability  Rapid and Continuous Growth  Amount of Data  Variety of Data Types  Dealing with personally identifiable information (PII) and personal health information (PHI)  Emergence of Fitness and Health Monitoring Apps  Google Fit and Apple HealthKit  New Apple ResearchKit for Medical Research Chaps25.40 Interoperability: A Classic View CSE 4701 Local Schema Simple Federation Multiple Nested Federation FDB Global Schema FDB Global Schema 4 Federated Integration Federated Integration Local Schema Local Schema FDB 1 Local Schema Federation FDB3 Federation Chaps25.41 Java Client with Wrapper to Legacy Application CSE 4701 Java Client Java Application Code WRAPPER Mapping Classes JAVA LAYER Interactions Between Java Client and Legacy Appl. via C and RPC C is the Medium of Info. Exchange Java Client with C++/C Wrapper NATIVE LAYER Native Functions (C++) RPC Client Stubs (C) Legacy Application Network Chaps25.42 COTS and Legacy Appls. to Java Clients CSE 4701 COTS Application Legacy Application Java Application Code Java Application Code Native Functions that Map to COTS Appl NATIVE LAYER Native Functions that Map to Legacy Appl NATIVE LAYER JAVA LAYER JAVA LAYER Mapping Classes JAVA NETWORK WRAPPER Mapping Classes JAVA NETWORK WRAPPER Network Java Client Java Client Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers Chaps25.43 Java Client to Legacy App via RDBS CSE 4701 Transformed Legacy Data Java Client Updated Data Relational Database System(RDS) Extract and Generate Data Transform and Store Data Legacy Application Chaps25.44 Database Interoperability in the Internet  CSE 4701  Technology  Web/HTTP, JDBC/ODBC, CORBA (ORBs + IIOP), XML, SOAP, REST API, WSDL Architecture Information Broker •Mediator-Based Systems •Agent-Based Systems Chaps25.45 JDBC  CSE 4701  JDBC API Provides DB Access Protocols for Open, Query, Close, etc. Different Drivers for Different DB Platforms JDBC API Java Application Driver Manager Driver Oracle Driver Access Driver Driver Sybase Chaps25.46 Connecting a DB to the Web  CSE 4701 DBMS  CGI Script Invocation or JDBC Invocation Web Server Internet  Web Server are Stateless DB Interactions Tend to be Stateful Invoking a CGI Script on Each DB Interaction is Very Expensive, Mainly Due to the Cost of DB Open Browser Chaps25.47 Connecting More Efficiently  CSE 4701 DBMS Helper Processes CGI Script or JDBC Invocation  Web Server Internet  To Avoid Cost of Opening Database, One can Use Helper Processes that Always Keep Database Open and Outlive Web Connection Newly Invoked CGI Scripts Connect to a Preexisting Helper Process System is Still Stateless Browser Chaps25.48 DB-Internet Architecture CSE 4701 WWW Client (Netscape) WWW client (Info. Explore) WWW Client (HotJava) Internet HTTP Server DBWeb Gateway DBWeb Gateway DBWeb Gateway DBWeb Dispatcher DBWeb Gateway Chaps25.49 EJB Architecture CSE 4701 Chaps25.50 Technology Push  CSE 4701   Computer/Communication Technology (Almost Free)  Plenty of Affordable CPU, Memory, Disk, Network Bandwidth  Next Generation Internet: Gigabit Now  Wireless: Ubiquitous, High Bandwidth Information Growth  Massively Parallel Generation of Information on the Internet and from New Generation of Sensors  Disk Capacity on the Order of Peta-bytes Small, Handy Devices to Access Information The focus is to make information available to users, in the right form, at the right time, in the appropriate place. Chaps25.51 Research Challenges  CSE 4701 Ubiquitous/Pervasive Many computers and information appliances everywhere, networked together  Inherent Complexity:  Coping with Latency (Sometimes Unpredictable)  Failure Detection and Recovery (Partial Failure)  Concurrency, Load Balancing, Availability, Scale  Service Partitioning  Ordering of Distributed Events “Accidental” Complexity:  Heterogeneity: Beyond the Local Case: Platform, Protocol, Plus All Local Heterogeneity in Spades.  Autonomy: Change and Evolve Autonomously  Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc. Chaps25.52 Infosphere Problem: too many sources,too much information CSE 4701 Internet: Information Jungle Infopipes Clean, Reliable, Timely Information, Anywhere Digital Earth Personalized Filtering & Info. Delivery Sensors Chaps25.53 Current State-of-Art – Has Mobile Changed This? CSE 4701 Web Server Mainframe Database Server Thin Client Chaps25.54 Infosphere Scenario – Where Does Mobile Fit? CSE 4701 Infotaps & Fat Clients Sensors Variety of Servers Many sources Database Server Chaps25.55 Heterogeneity and Autonomy  CSE 4701 Heterogeneity:  How Much can we Really Integrate?  Syntactic Integration  Different Formats and Models  XML/JSON/RDF/OWL/SQL Query Languages  Semantic Interoperability  Basic Research on Ontology, Etc.  DoD Maps (Grid, True, and Magnetic North)  Autonomy  No Central DBA on the Net  Independent Evolution of Schema and Content  Interoperation is Voluntary  Interface Technology DCOM: Microsoft Standard  CORBA, Etc... Chaps25.56 Security and Data Quality  CSE 4701 Security  System Security in the Broad Sense  Attacks: Penetrations, Denial of Service  System (and Information) Survivability  Security Fault Tolerance  Replication for Performance, Availability, and Survivability  Data Quality  Web Data Quality Problems     Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise Chaps25.57 Legacy Data Challenge  CSE 4701  Legacy Applications and Data  Definition: Important and Difficult to Replace  Typically, Mainframe Mission Critical Code  Most are OLTP and Database Applications Evolution of Legacy Databases  Client-server Architectures  Wrappers  Expensive and Gradual in Any Case Chaps25.58 Potential Value Added/Jumping on Bandwagon  CSE 4701     Sophisticated Query Capability  Combining SQL with Keyword Queries Consistent Updates  Atomic Transactions and Beyond But Everything has to be in a Database!  Only If we Stick with Classic DB Assumptions Relaxing DB Assumptions  Interoperable Query Processing  Extended Transaction Updates Commodities DB Software  A Little Help is Still Good If it is Cheap  Internet Facilitates Software Distribution  Databases as Middleware Chaps25.59 Concluding Remarks  CSE 4701  Goals of Distributed DBS  Support User Distribution Across Multiple Sites  Provide Location Transparency  Provide Transaction Control Akin to Centralized Case  Minimize Communications Cost Advantages of Distributed DBS  Transparency  Increased Reliability and Availability  Improved Performance  Easier Expansion Chaps25.60 Data Mining/Warehousing CSE 4701 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 [email protected] http://www.engr.uconn.edu/~steve (860) 486 - 4818   A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. Remaining slides represent new material. Chaps25.61 Data Warehousing and Data Mining  CSE 4701  Data Warehousing  Provide Access to Data for Complex Analysis, Knowledge Discovery, and Decision Making  Underlying Infrastructure in Support of Mining  Provides Means to Interact with Multiple DBs  OLAP (on-Line Analytical Processing) vs. OLTP Data Mining  Discovery of Information in a Vast Data Sets  Search for Patterns and Common Features based  Discover Information not Previously Known  Medical Records Accessible Nationwide  Research/Discover Cures for Rare Diseases  Relies on Knowledge Discovery in DBs (KDD) Chaps25.62 What is Purpose of a Data Warehouse?  CSE 4701      Traditional databases are not optimized for data access but have to balance the requirement of data access to ensure integrity Most data warehouse users need only read access, but need the access to be fast over a large volume of data. Most of the data required for data warehouse analysis comes from multiple databases and these analysis are recurrent and predictable to be able to design software meet requirements. Critical for tools that provide decision makers with information to make decisions quickly and reliably based on historical data. Aforementioned Charactereistics achieved by Data Warehousing and Online analytical processing (OLAP) W. H Inmon characterized a data warehouse as:  “A subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decisions.” Chaps25.63 Data Warehousing and OLAP  CSE 4701   A Data Warehouse  Database is Maintained Separately from an Operational Database  “A Subject-Oriented, Integrated, Time-Variant, and Non-Volatile Collection of Data in Support for Management’s Decision Making Process [W.H.Inmon]” OLAP (on-Line Analytical Processing)  Analysis of Complex Data in the Warehouse  Attempt to Attain “Value” through Analysis  Relies on Trained and Adept Skilled Knowledge Workers who Discover Information Data Mart  Organized Data for a Subset of an Organization Chaps25.64 Building a Data Warehouse  CSE 4701 Option 1  Leverage Existing Repositories  Collate and Collect  May Not Capture All Relevant Data  Option 2  Start from Scratch  Utilize Underlying Corporate Data Corporate data warehouse Option 1: Consolidate Data Marts Option 2: Build from scratch Data Mart ... Data Mart Data Mart Data Mart Corporate data Chaps25.65 Comparison with Traditional Databases  CSE 4701     Data Warehouses are mainly optimized for appropriate data access  Traditional databases are transactional  Optimized for both access mechanisms and integrity assurance measures. Data warehouses emphasize historical data as their support time-series and trend analysis. Compared with transactional databases, data warehouses are nonvolatile. In transactional databases, transaction is the mechanism change to the database. In warehouse, data is relatively coarse grained and refresh policy is carefully chosen, usually incremental. Chaps25.66 Classification of Data Warehouses  CSE 4701  Generally, Data Warehouses are an order of magnitude larger than the source databases. The sheer volume of data is an issue, based on which Data Warehouses could be classified as follows.  Enterprise-wide data warehouses  Huge projects requiring massive investment of time and resources.  Virtual data warehouses  Provide views of operational databases that are materialized for efficient access.  Data marts  Generally targeted to a subset of organization, such as a department, and are more tightly focused. Chaps25.67 Data Warehouse Characteristics  CSE 4701    Utilizes a “Multi-Dimensional” Data Model Warehouse Comprised of  Store of Integrated Data from Multiple Sources  Processed into Multi-Dimensional Model Warehouse Supports of  Times Series and Trend Analysis  “Super-Excel” Integrated with DB Technologies Data is Less Volatile than Regular DB  Doesn’t Dramatically Change Over Time  Updates at Regular Intervals  Specific Refresh Policy Regarding Some Data Chaps25.68 Three Tier Architecture CSE 4701 monitor External data sources OLAP Server integrator Summarization report Operational databases Extraxt Transform Load Refresh serve Data Warehouse Query report Data mining metadata Data marts Chaps25.69 Data Modeling for Data Warehouses  CSE 4701  Traditional Databases generally deal with twodimensional data (similar to a spread sheet).  However, querying performance in a multidimensional data storage model is much more efficient. Data warehouses can take advantage of this feature as generally these are  Non volatile  The degree of predictability of the analysis that will be performed on them is high. Chaps25.70 What is a Multi-Dimensional Data Cube?  CSE 4701    Representation of Information in Two or More Dimensions Typical Two-Dimensional - Spreadsheet In Practice, to Track Trends or Conduct Analysis, Three or More Dimensions are Useful Aggregate Raw Data! Chaps25.71 Data Warehouse Design  CSE 4701   Most of Data Warehouses use a Start Schema to Represent Multi-Dimensional Data Model Each Dimension is Represented by a Dimension Table that Provides its Multidimensional Coordinates and Stores Measures for those Coordinates A Fact Table Connects All Dimension Tables with a Multiple Join  Each Tuple in Fact Table Represents the Content of One Dimension  Each Tuple in the Fact Table Consists of a Pointer to Each of the Dimensional Tables  Links Between the Fact Table and the Dimensional Tables for a Shape Like a Star Chaps25.72 Sample Fact Tables CSE 4701 Chaps25.73 Example of Star Schema CSE 4701 Product Date Date Month Year Sale Fact Table Date ProductNo ProdName ProdDesc Categoryu Product Store Customer Unit_Sales Store StoreID City State Country Region Dollar_Sales Customer CustID CustName CustCity CustCountry Chaps25.74 A Second Example of Star Schema … CSE 4701 Chaps25.75 and Corresponding Snowflake Schema CSE 4701 Chaps25.76 Multi-dimensional Schemas  CSE 4701 Fact Constellation  Fact constellation is a set of tables that share some dimension tables.  However, fact constellations limit the possible queries for the warehouse. Chaps25.77 Data Warehouse Issues  CSE 4701  Data Acquisition  Extraction from Heterogeneous Sources  Reformatted into Warehouse Context - Names, Meanings, Data Domains Must be Consistent  Data Cleaning for Validity and Quality is the Data as Expected w.r.t. Content? Value?  Transition of Data into Data Model of Warehouse  Loading of Data into the Warehouse Other Issues Include:  How Current is the Data? Frequency of Update?  Availability of Warehouse? Dependencies of Data?  Distribution, Replication, and Partitioning Needs?  Loading Time (Clean, Format, Copy, Transmit, Index Creation, etc.)? Chaps25.78 Processing in a Data Warehouse  CSE 4701 Processing Types are Varied and Include:  Roll-up: Data is summarized with increasing generalization  Drill-Down: Increasing levels of detail are revealed  Pivot: Cross tabulation is performed  Slice and dice: Performing projection operations on the dimensions.  Sorting: Data is sorted by ordinal value.  Selection: Data is available by value or range.  Derived attributes: Attributes are computed by operations on stored derived values. Chaps25.79 On-Line Analytical Processing  CSE 4701  Data Cube  A Multidimensonal Array  Each Attribute is a Dimension In Example Below, the Data Must be Interpreted so that it Can be Aggregated by Region/Product/Date Product Product Store Date Sale acron Rolla,MO 7/3/99 325.24 budwiser LA,CA 5/22/99 833.92 large pants NY,NY 2/12/99 771.24 Pants Diapers Beer Nuts West East 3’ diaper Cuba,MO 7/30/99 81.99 Region Central Mountain South Jan Feb March April Date Chaps25.80 Examples of Data Mining  CSE 4701 The Slicing Action  A Vertical or Horizontal Slice Across Entire Cube Months Slice on city Atlanta Products Sales Products Sales Months Multi-Dimensional Data Cube Chaps25.81 Examples of Data Mining  CSE 4701 The Dicing Action  A Slide First Identifies on Dimension  A Selection of Any Cube within the Slice which Essentially Constrains All Three Dimensions Months Products Sales Products Sales Months March 2000 Electronics Atlanta Dice on Electronics and Atlanta Chaps25.82 Examples of Data Mining Drill Down - Takes a Facet (e.g., Q1) and Decomposes into Finer Detail Jan Feb March Products Sales CSE 4701 Drill down on Q1 Roll Up on Location (State, USA) Roll Up: Combines Multiple Dimensions From Individual Cities to State Q1 Q2 Q3 Q4 Products Sales Products Sales Q1 Q2 Q3 Q4 Chaps25.83 Mining Other Types of Data  Analysis and Access Dramatically More Complicated! CSE 4701 Spatial databases Multimedia databases World Wide Web Time series data Geographical and Satellite Data Chaps25.84 Advantages/Objectives of Data Mining  CSE 4701   Descriptive Mining  Discover and Describe General Properties  60% People who buy Beer on Friday also have Bought Nuts or Chips in the Past Three Months Predictive Mining  Infer Interesting Properties based on Available Data  People who Buy Beer on Friday usually also Buy Nuts or Chips Result of Mining  Order from Chaos  Mining Large Data Sets in Multiple Dimensions Allows Businesses, Individuals, etc. to Learn about Trends, Behavior, etc.  Impact on Marketing Strateg Chaps25.85 Why is Data Mining Popular?  CSE 4701 Technology Push  Technology for Collecting Large Quantity of Data  Bar Code, Scanners, Satellites, Cameras  Technology for Storing Large Collection of Data  Databases, Data Warehouses  Variety of Data Repositories, such as Virtual Worlds, Digital Media, World Wide Web  Corporations want to Improve Direct Marketing and Promotions - Driving Technology Advances  Targeted Marketing by Age, Region, Income, etc.  Exploiting User Preferences/Customized Shopping Chaps25.86 Requirements & Challenges in Data Mining  CSE 4701    Security and Social  What Information is Available to Mine?  Preferences via Store Cards/Web Purchases  What is Your Comfort Level with Trends? User Interfaces and Visualization  What Tools Must be Provided for End Users of Data Mining Systems?  How are Results for Multi-Dimensional Data Displayed? Performance Guarantees  Range from Real-Time for Some Queries to LongTerm for Other Queries Data Sources of Complex Data Types or Unstructured Data - Ability to Format, Clean, and Load Data Sets Chaps25.87 Successful Data Mining Applications  CSE 4701    Business Data Analysis and Decision Support  Marketing, Customer Profiling, Market Analysis and Management, Risk Analysis and Management Fraud Detection  Detecting Telephone Fraud, Automotive and Health Insurance Fraud, Credit-card Fraud, Suspicious Money Transactions (Money Laundering) Text Mining  Message Filtering (Email, Newsgroups, Etc.)  Newspaper Articles Analysis Sports  IBM Advanced Scout Analyzed NBA Game Statistics (Shots Blocked, Assists and Fouls) to Gain Competitive Advantage Chaps25.88