Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RA up your storage options P M Strategies for Warehouse Repositories Dr. Thomas Becker SAS Institute 1 SAS System Background • A major role of the SAS System has been, and continues to be: – Accessing difficult data structures, transparently – Manipulating data through a powerful and easy to use 4th generation programming language – Organizing data for business reporting and statistical analysis In this presentation we will be exploring SAS data stores within the broader context of the strengths of the SAS System and our core technologies, and the role SAS plays in the software marketplace. 2 SAS Data Storage Strategy: RAMP • • • • Relational: Access: MDDB: Parallel: SAS Tables and Views SAS/ACCESS Views SAS/MDDB SPDS Parallel Server Our data storage strategy for the SAS System is RAMP - Relational, Access, MDDB, and Parallel. The RAMP acronym for SAS data storage repositories comes from the article “Orlando II ramps up storage options,” SAS Communications, A Quarterly Magazine for European SAS Software Users, 4Q ‘96, pages 22-23. A white paper on SAS data storage repositories is also being written by Rick Evans in our European office, and will be available to you soon as an additional resource. 3 SAS Data Storage Strategy: RAMP • • • • Relational: Access: MDDB: Parallel: SAS Tables and Views SAS/ACCESS Views SAS/MDDB SPDS Parallel Server 4 SAS Tables and Views • Architecture • Features • Positioning 5 SAS Tables: Architecture Descriptor Information OBSERVATION /ROW VARIABLE/ COLUMN Data values Indexes SAS Tables: Definition A SAS table is a collection of data values and their associated descriptive information arranged and presented in a form that can be recognized and processed by the SAS system. SAS Tables: Components • Data Values organized in a rectangular structure of columns and rows • Descriptor Information that identifies attributes of both the table and data values, providing the most basic level of metadata in the SAS system and making the table a self-documenting object • Indexes to enable the SAS system to quickly locate the observations (rows of data) associated with a particular value or range of data values for a key variable (column). (Implementation note: Global B-tree of value/row identifier pairs which enables the SAS engine to search by data values quickly). An alternative to the pass-through mode is using SAS views and SAS tables. 6 SAS Data Library: High Level Architecture SAS Data Library SAS Catalogs SAS Tables SAS Data Files SAS Data Views Other SAS Files SAS Stored Programs SAS Access Descriptors Briefly, let’s take a bird’s eye view of the global perspective of the SAS Data Storage architecture, and where SAS tables and views fit into the larger schema. 7 Relational Data Model • Data is organized in tables • Specific data elements are identified by columns and rows Columns/Rows Row1 Row2 Row3 Row4 Col1 Col2 Col3 Data Value The traditional relational model for RDBMS systems organizes data into a rectangular table format where data elements are identified by columns and rows. The SAS system also has a rectangular data storage model where data elements are identified by variables (columns) and observations (rows). We will now begin a thread of discussion where we compare and contrast the relational structure of SAS tables and views with the traditional relational model behind relational databases. Relational databases are designed primarily for Online Transaction Processing (OLTP) systems and are tuned to maximize the efficiency of data entry operations, which allow businesses to automate and record their business activities. The SAS system can access, analyze, and present data from RDBMS systems, but is not itself a relational database. The SAS system is tuned to maximize the efficiency of data analysis and ad-hoc queries. We are a major player in the Decision Support, Online Analytical Processing (OLAP), Data Warehousing, Data Mining, Business Intelligence, and Applied Analysis markets, which have very different needs and priorities than the OLTP market. 8 Indexing Age: 1 2 ... n Observation: 2,23,44,45 1,33,40 ... An index is a method or object that provides the direct location in storage of a particular observation/row for a given variable/column value. Indexes are used to traverse relational tables with minimal reads. This process increases efficiency and response time. For SAS tables and views, indexes are implemented using a global B-tree. Definition: A B-tree is the common name for a Binary Tree data structure. A B-tree is a hierarchy of index relationships that start with a root node, with branches to 0, 1, or 2 children, or leaf nodes. Each leaf node may have 0,1,or 2 children as well. Searches for particular data values traverse the B-tree paths from the root node down the branches until the desired leaf node is reached. This is much faster than a sequential search of the entire data file. Here is an example, based a character field: Mary / \ Chuck / Ann Pete / \ Nancy Tom 9 SAS Tables: Features • • • • • • Row/Column structure Portable across SAS platforms ODBC compliant Extremely efficient Self-defining metadata Large file support > 2 Gb. (except VAX/VMS, Windows 95 and 3.2s) Reference: “Open Systems Solutions to Large File Requirements,” Proceedings of the Twenty-First Annual SAS Users Group International Conference, pages 1437-1440. For SAS 6.12, the solution to support files over 2 Gb. requires AIX 4.2, Solaris 2.6, or HP/UX 10.20. Other solutions are outlined in Tom’s paper referenced above for other UNIX operating systems. Alpha/VMS and Windows NT also support files over 2 GB, but VAX/VMS, Windows 95 and Windows 3.2s do not. MVS and other mainframe systems: Do not have the 2 GB file size limitation in the operating system that is a challenge for other platforms. It is highly unlikely that SAS users will ever run out of diskspace on MVS, since a SAS data file can have 2.147 billion blocks, with a blocksize up to 1/2 track (30,000 bytes). SAS data files can contain up to 2.147 billion observations, which could potentially limit the size of a SAS table on MVS. 10 6$6'DWD:DUHKRXVH 6$6 )OH[LEOH'LVNVSDFH 0DQDJHPHQW 5'%06 64/6XSSRUW ,QGH[6XSSRUW 9LHZV 'DWD&RPSUHVVLRQ 'DWD6HFXULW\ 1R2/73§2YHUKHDG¨ $FFHVVWRDUFKLYHG'DWD Reference: William D. Clifford, “Is the SAS System a Database Management System? Proceedings of the Eighteenth Annual SAS Users Group International Conference, 1993, pages 168-173. Definitions: Flexible Diskspace management: MVA technology to access remote data & logical names (SAS libnames) to reference data paths instead of hard-coded path names. SQL Support: The ANSI standard SQL language is supported, which is the standard query language for relational databases. Index Support: Indexes are supported, which improves performance Views: Views of relational tables are supported, which saves disk space Data compression: Data can be compressed for storage efficiency. Data security: User access to data is restricted by access control lists, grant, roles, passwords and other means to prevent unauthorized access to sensitive data. No OLTP Overhead: No performance hits from transactional operations for rollback and recovery. Access to Archived Data: Data can be read directly from tape archives 11 Summary • SAS tables support a relational data model • SAS tables have been designed for optimized performance of decision support operations like reporting and analysis of large volumes of data 12 SAS Tables: Positioning • Flagship data repository of the SAS system for over 20 years • Optimized solution for – Decision Support applications – OLAP – Data Warehousing – Data Mining – Applied Analysis – Business Intelligence 13 SAS Data Storage Strategy: RAMP • • • • Relational: Access: MDDB: Parallel: SAS Tables and Views SAS/ACCESS Views SAS/MDDB SPDS Parallel Server 14 SAS/Access Data Views • Architecture • Features • Positioning 15 SAS Tables: Architecture Data Views: Data Files: View Definitions Data Definitions Path to Data SAS Data DB2 Ingres Oracle SAS Tables: Types 1) SAS Data Files: A SAS Data File is a SAS Table that stores descriptor information (data definitions) and observations (data values) in the same location. 2) SAS Data Views: A SAS Data View is a SAS Table in which the descriptor information (data definitions) and the observations (data values) are obtained from separate files. SAS Data Views store the information (metadata) required to retrieve data values that are stored in other files. SAS Views are used to define subsets of data which saves storage space. SAS Views also reduce maintenance costs because changes made to the source data are automatically reflected by the view’s results. They can be created with either SQL or ACCESS procedures or the DATA step language. In this section we will focus on SAS/ACCESS views which allow the SAS system to access data stored in relational, hierarchical, and network databases and other non-SAS data repositories. 16 Multiple Engine Architecture Engines 17 SAS/ACCESS Data Views: Features • Leverages IT investment in DBMS technology • Leverages database client/server technologies to supplement SAS solutions SAS/ACCESS views provide access to OLTP and legacy data sources, including relational, hierarchical, and network databases as well as ODBC-compliant data stores or other DSS databases, such as Redbrick which is a data store based on a star schema model instead of the traditional relational database model. SAS/ACCESS views allow IT to leverage their investment in DBMS technology. The SAS system can access their existing data sources and does not require expensive and time-consuming data conversion processes. SAS/ACCESS views leverage the client/server technologies of existing data sources to supplement our own solutions as part of our MVA and MEA strategies to access virtually all data sources in an organization no matter where the data resides in terms of both the platform and storage repository. 18 SAS/ACCESS Views: Positioning “ . . . to make enterprise data, regardless of their source or structure, a generalized and available resource.” 19 SAS Data Storage Strategy: RAMP • • • • Relational: Access: MDDB: Parallel: SAS Tables and Views SAS/ACCESS Views SAS/MDDB SPDS Parallel Server 20 SAS/MDDB Multidimensional Database • Architecture • Features • Positioning 21 Tables vs. Multidimensional data model Sales Part person Date no. # Price Jones 12 Sept A307 34 18.37 Smith 12 Sept A482 27 21.04 Jones 13 Sept B32 100 3.78 Henter 13 Sept B32 250 3.78 Wilson 12 Sept A307 29 18.37 Smith 39 2.22 12 Sept B19 Geography Time Product Sales for East, Sept. & A307 The basic enhancement a focus on multidimensionality offers to client applications is access to data, organized in a fashion that aligns with the way business end-users understand their enterprise. This is accomplished by defining “dimensions” of the business, and then providing summaries of data within the definitions of those “dimensions”. This enables users to ask the more typical query which involves a number of factors (e.g. geography, time, product, budget vs. actual, etc.) without being familiar with the underlying organization of the data. Tables, or relational tables, in data warehouses, typically are organized in a subject oriented fashion, which should cut down on the number times an end-user has to “join” tables together to answer a question. The use of SQL to manage retrieving data from tables is NOT an end-user task, and can actually be cumbersome, even for professional developers. 22 Dimensional Definitions Organization (Sub Division, Application) Platform (OS group, OS) Product (Product) Time (Year, Month) Geographic (Country, Sales Person) Scenarios (Actual, Budget, Variance, LY, LY DIff) Just what are dimensions? Here I will borrow a few paragraphs from a very good description of multidimensionality in “Managing Multidimensional Data: Harnessing the Power”, Database and Programming Design, Volume 8, Number 8, August 1995, pages 2433. The data fundamental to all multidimensional analysis represents unique object or event instances, denoted by descriptive attributes. .... The dimensions typically fall into two categories: fact and descriptive. “Fact” dimensions represent some quantifiable or measurable aspect of the observed objects or events. Normally the fact dimension is represented by numeric, detailed individual values (such as units, dollars or temperature) for instance of the measured object or event. ... The other aspects of information collected about each event or object are used as “descriptive” dimensions. These dimensions are usually less granular; they have a smaller range of values (for example, color) and are used to order, group, or summarize the values of the fact dimensions. Each dimension has a name and a set of elements, or values, that can be legitimately assigned to instances for that dimension. So, dollar value of sale would be a “fact” dimension, while product name, region or time period would be “descriptive”. 23 SAS/MDDB server Subtable Request Lookup table Reach through Base Table The SAS/MDDB server is a new multidimensional database which is a specialized data storage facility (not a SAS table), where data may be pulled from a data warehouse or other data sources for storage in a matrix-like format for fast and easy access by tools such as multidimensional viewers. The SAS/MDDB is a read-only data storage structure containing summarized information. The SAS MDDB object follows a two step methodology for providing multidimensional (OLAP-style) access to end user applications. The first step is the creation of the MDDB N-Way crossing. This represents a "fact table" of the full list of crossings specified in the creation phase of the MDDB. Levels with valid values are stored, thus addressing the "sparsity" problem in the first phase. This step has shown significant reduction in size of data as compared to the target base table. Some of this reduction is due to subsetting the number of columns retained. From this N-Way crossing, any number of sub tables can additionally be created. These subtables are simply specific instances (and usually subsets) of the larger N-Way. Data warehouse designers will observer significant improvement in performance depending on how well they can match up anticipated application demands, with the creation of sub tables. The trade off will be expanded use of disk space to store the subtables. The resulting MDDB is limited to a 2 GB. file size, but the SAS input files are not subject to a 2 Gb. restriction. 24 Size Relationship Tables Tables vs. vs. MDDB MDDB An MDDB will generally be much smaller than the base table or tables it is created from. The reasons for this include: •atomic level detail not stored in MDDB •storing calculations is optional •summary records compressed where possible •dimensional labels stored separate from tables, saving space on each record The actual size reduction will vary depending on a variety of factors. So, a structure with many discrete dimensions will result in a larger MDDB than one with fewer. The number of analysis variables kept will effect size (e.g. more analysis variables, larger size). Whether calculations are stored also can have a significant impact on size. Calculations are often done more efficiently on the subset of interest to the client...and can be quickly accomplished on client machines. Storing pre-calculations can lead to much larger MDDB structures. For performance purposes, subtables can also be specified. These higher level summarizations usually take up very little space relative to the central fact table, and are scanned first by client applications, to see if they satisfy the requests being made. 25 MDDB/Server: Features • Built from SAS tables or views • Sparsely populated • Reach-through to more detailed data if requested • Look across different hardware platforms with remote library services Sparsity: 1) The MDDB administrator has control on how many subtables to build 2) SAS/MDDB only stores cells with data 3) SAS/MDDB only stores up to 8 statistics for every analysis field; 13 other statistics are derived at run time. For example, average is calculated from sum and n. 26 SAS/MDDB: Positioning • • • • Fast loading Data reduction Fast retrieval and query processing Reach-through to source data What is the benefit to customers? The SAS/MDDB data model for online analytical processing (OLAP) will enable better decision making by giving business users quick, unlimited views of multiple relationships for large quantities of summarized data. 27 SAS Data Storage Strategy: RAMP • • • • Relational: Access: MDDB: Parallel: SAS Tables and Views SAS/ACCESS Views SAS/MDDB SPDS Parallel Server 28 SAS/MDDB Multidimensional Database • Architecture • Features • Positioning 29 Introduction to SPDS Scaleable Performance Data Server: • A high performance multi-user server designed for storing and retrieving data • A complete C/S solution - fully scaleable • A data repository optimized for Data Warehousing • Fine Granularity Security Typical Client/Server configuration: SAS or ODBC Client on Windows 95 SPDS Server on SUN SPARC 1000, 4 CPU node running Solaris 2.5 Currently, the SPDS server is available on Solaris 2.5 platforms. The server will be available on other UNIX server platforms in a future release. The aspect of Solaris 2.5 used by the SPDS server that is not directly portable to other UNIX operating systems is the use of Solaris light -weight threads. For more information on SPDS, see the PRODUCT POSITIONING on INFOSITE at URL: http://www.unx.sas.com/mkt/smw/products/spdssrv/position/overview. html 30 SPDS Table Component Files Architecture SPDS Table Index(es) SPDS Table Table Metadata Table Data Index Data Ranking Data SPDS data storage structures are NOT SAS tables, but these structures can be MAPPED to SAS tables, and as such are accessible to the SAS system 4GL and most SAS diverse user interfaces. It is also important to note that the SPDS data store is open via ODBC and is accessible to non-SAS client software. SPDS storage detailed information: 1. SAS tables cannot operate at the PAGE level of data access; SPDS can. 2. The TABLE DATA in the diagram above is stored as SEGMENTS, not PAGES. 3. The INDEX DATA in the diagram above can either be stored as a segmented bitmap index or as a B-tree 4. The RANKING DATA in the diagram above is a global B-tree (same as the global B-tree index for SAS tables), and is used primarily for SAS “BY STATEMENT” processing. Hint: Sort data first for better performance. 31 SPDS: Features • Plug-Compatibility with SAS (libname) • Parallel queries • Parallel index creation • Parallel append • Parallel sorts • Advanced Bitmap indexing • Large files - over 2 Gb. • Libname Domain Name Server • Centralized Security - ACL's Plug- compatibility with SAS as SPDS client: Uses Libname engine interface Supports SAS data files, indexes, catalogs, & utility files PROC COPY/APPEND or Datastep for loading Member-level locking UNIX utilities for moving, backup/restore Note: Coming with SPDS Version 2 is full SQL Passthrough engine support to process SQL joins at the SPDS server end, instead of SAS or ODBC client end of data processing system implementation. The SPDS name server maps administrative information into a domain of SAS tables and provides transparent mapping of LIBNAME definitions to specific SPDS data stores. Data access to SPDS data is granted through Access Control Lists (ACL’s) for the following permissions: ALTER, READ, WRITE, CONTROL. Default is no access to users other than the data owner. 32 SPDS: Positioning • Performance gains in data access • Concurrent user access & member level locking • Compatible with existing SAS programs (libname) • Optimized for Data Warehousing (reading large volumes of data) • ODBC client applications • Network optimization 33 SAS Data Storage Strategy: RAMP • • • • Relational: Access: MDDB: Parallel: SAS Tables and Views SAS/ACCESS Views SAS/MDDB SPDS Parallel Server Our data storage strategy for the SAS System is RAMP - Relational, Access, MDDB, and Parallel. The RAMP acronym for SAS data storage repositories comes from the article “Orlando II ramps up storage options,” SAS Communications, A Quarterly Magazine for European SAS Software Users, 4Q ‘96, pages 22-23. A white paper on SAS data storage repositories is also being written by Rick Evans in our European office, and will be available to you soon as an additional resource. 34 SAS Data Storage/Target Application Chart - RAMP DW DM OLAP DSS Relational Tables ACCESS Views MDDB Parallel (SPDS) Legend: DW = DATA WAREHOUSING, DM = DATA MINING, DSS = DECISION SUPPORT SYSTEMS, OLAP = ONLINE ANALYTICAL PROCESSING The major markets for the SAS System are Decision Support Applications, Data Warehousing, Applied Analysis, Data Mining, and Business Intelligence, and OLAP (Online Analytical Processing). SAS does not play in the OLTP (Online Transaction Processing) market, which is dominated by the relational database vendors. SAS is not optimized for transaction processing operations, and we recommend that users not run SAS analytical and reporting tools directly against their transactional databases while data entry operations are in progress. Since OLTP transactions are continuously being updated, reports reflect the current snapshot of data at that particular moment in time. The same report run a few minutes later will reflect different results, based on that particular time slice. This phenomenon is known in the industry as the “twinkling data” problem. Instead, we recommend that batch programs be run to load and periodically refresh a SAS data repository which provides a stable data store for the reporting and analysis needs of business users. The SAS/Warehouse Administrator is our product offering to coordinate these activities, as well as data cleansing, transformation, and summarization operations. We can also easily plug into existing data warehouses. If users want to run reports directly against ACCESS views to OLTP data stores, we recommend nightly reports when data entry operations are not in progress. 35 Time for questions ... How about your storage R AM P ? Thank you very much for your attention! 36