Download RAMP' up your storage options - strategies for warehouse repositories

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
RA
up your storage options
P
M
Strategies for
Warehouse Repositories
Dr. Thomas Becker
SAS Institute
1
SAS System
Background
• A major role of the SAS System has been, and
continues to be:
– Accessing difficult data structures, transparently
– Manipulating data through a powerful and easy
to use 4th generation programming language
– Organizing data for business reporting and
statistical analysis
In this presentation we will be exploring SAS data stores within the
broader context of the strengths of the SAS System and our core
technologies, and the role SAS plays in the software marketplace.
2
SAS Data Storage
Strategy: RAMP
•
•
•
•
Relational:
Access:
MDDB:
Parallel:
SAS Tables and Views
SAS/ACCESS Views
SAS/MDDB
SPDS Parallel Server
Our data storage strategy for the SAS System is RAMP - Relational,
Access, MDDB, and Parallel.
The RAMP acronym for SAS data storage repositories comes from the
article “Orlando II ramps up storage options,” SAS Communications,
A Quarterly Magazine for European SAS Software Users, 4Q ‘96,
pages 22-23.
A white paper on SAS data storage repositories is also being written
by Rick Evans in our European office, and will be available to you
soon
as an additional resource.
3
SAS Data Storage
Strategy: RAMP
•
•
•
•
Relational:
Access:
MDDB:
Parallel:
SAS Tables and Views
SAS/ACCESS Views
SAS/MDDB
SPDS Parallel Server
4
SAS Tables and Views
• Architecture
• Features
• Positioning
5
SAS Tables:
Architecture
Descriptor
Information
OBSERVATION /ROW
VARIABLE/
COLUMN
Data values
Indexes
SAS Tables: Definition
A SAS table is a collection of data values and their associated
descriptive information arranged and presented in a form that can be
recognized and processed by the SAS system.
SAS Tables: Components
• Data Values organized in a rectangular structure of columns and
rows
• Descriptor Information that identifies attributes of both the table
and data values, providing the most basic level of metadata in the
SAS system and making the table a self-documenting object
• Indexes to enable the SAS system to quickly locate the
observations (rows of data) associated with a particular value or range
of data values for a key variable (column). (Implementation note:
Global B-tree of value/row identifier pairs which enables the SAS
engine to search by data values quickly).
An alternative to the pass-through mode is using SAS views and
SAS tables.
6
SAS Data Library: High
Level Architecture
SAS Data Library
SAS
Catalogs
SAS
Tables
SAS
Data
Files
SAS
Data
Views
Other SAS
Files
SAS
Stored
Programs
SAS
Access
Descriptors
Briefly, let’s take a bird’s eye view of the global perspective of the SAS
Data Storage architecture, and where SAS tables and views fit into the
larger schema.
7
Relational Data Model
• Data is organized in tables
• Specific data elements are identified by
columns and rows
Columns/Rows
Row1
Row2
Row3
Row4
Col1
Col2
Col3
Data Value
The traditional relational model for RDBMS systems organizes data
into a rectangular table format where data elements are identified by
columns and rows. The SAS system also has a rectangular data
storage model where data elements are identified by variables
(columns) and observations (rows).
We will now begin a thread of discussion where we compare and
contrast the relational structure of SAS tables and views with the
traditional relational model behind relational databases.
Relational databases are designed primarily for Online Transaction
Processing (OLTP) systems and are tuned to maximize the efficiency
of data entry operations, which allow businesses to automate and
record their business activities. The SAS system can access, analyze,
and present data from RDBMS systems, but is not itself a relational
database.
The SAS system is tuned to maximize the efficiency of data analysis
and ad-hoc queries. We are a major player in the Decision Support,
Online Analytical Processing (OLAP), Data Warehousing, Data Mining,
Business Intelligence, and Applied Analysis markets, which have very
different needs and priorities than the OLTP market.
8
Indexing
Age:
1
2
...
n
Observation:
2,23,44,45
1,33,40
...
An index is a method or object that provides the direct
location in storage of a particular observation/row for a
given variable/column value. Indexes are used to
traverse relational tables with minimal reads. This
process increases efficiency and response time. For
SAS tables and views, indexes are implemented using a
global B-tree.
Definition: A B-tree is the common name for a Binary
Tree data structure. A B-tree is a hierarchy of index
relationships that start with a root node, with branches to
0, 1, or 2 children, or leaf nodes. Each leaf node may
have 0,1,or 2 children as well. Searches for particular
data values traverse the B-tree paths from the root node
down the branches until the desired leaf node is
reached. This is much faster than a sequential search of
the entire data file. Here is an example, based a
character field:
Mary
/
\
Chuck
/
Ann
Pete
/
\
Nancy Tom
9
SAS Tables: Features
•
•
•
•
•
•
Row/Column structure
Portable across SAS platforms
ODBC compliant
Extremely efficient
Self-defining metadata
Large file support > 2 Gb.
(except VAX/VMS, Windows 95 and 3.2s)
Reference:
“Open Systems Solutions to Large File
Requirements,” Proceedings of the
Twenty-First Annual SAS Users Group
International Conference, pages 1437-1440.
For SAS 6.12, the solution to support files over 2 Gb. requires AIX 4.2,
Solaris 2.6, or HP/UX 10.20. Other solutions are outlined in Tom’s
paper referenced above for other UNIX operating systems.
Alpha/VMS
and Windows NT also support files over 2 GB, but VAX/VMS,
Windows
95 and Windows 3.2s do not.
MVS and other mainframe systems: Do not have the 2 GB file size
limitation in the operating system that is a challenge for other
platforms.
It is highly unlikely that SAS users will ever run out of diskspace on
MVS, since a SAS data file can have 2.147 billion blocks, with a
blocksize up to 1/2 track (30,000 bytes). SAS data files can contain
up to 2.147 billion observations, which could potentially limit the size of
a SAS table on MVS.
10
6$6'DWD:DUHKRXVH
6$6
)OH[LEOH'LVNVSDFH
0DQDJHPHQW
5'%06
64/6XSSRUW
,QGH[6XSSRUW
9LHZV
'DWD&RPSUHVVLRQ
'DWD6HFXULW\
1R2/73§2YHUKHDG¨
$FFHVVWRDUFKLYHG'DWD
Reference:
William D. Clifford, “Is the SAS System a Database Management
System? Proceedings of the Eighteenth Annual SAS Users Group
International Conference, 1993, pages 168-173.
Definitions:
Flexible Diskspace management: MVA technology to access remote
data & logical names (SAS libnames) to reference data
paths instead of hard-coded path names.
SQL Support: The ANSI standard SQL language is supported, which
is
the standard query language for relational databases.
Index Support: Indexes are supported, which improves performance
Views: Views of relational tables are supported, which saves disk
space
Data compression: Data can be compressed for storage efficiency.
Data security: User access to data is restricted by access control lists,
grant, roles, passwords and other means to prevent
unauthorized access to sensitive data.
No OLTP Overhead: No performance hits from transactional
operations
for rollback and recovery.
Access to Archived Data: Data can be read directly from tape
archives
11
Summary
• SAS tables support a relational data model
• SAS tables have been designed for optimized
performance of decision support operations
like reporting and analysis of large volumes of
data
12
SAS Tables: Positioning
• Flagship data repository of the SAS system
for over 20 years
• Optimized solution for
– Decision Support applications
– OLAP
– Data Warehousing
– Data Mining
– Applied Analysis
– Business Intelligence
13
SAS Data Storage
Strategy: RAMP
•
•
•
•
Relational:
Access:
MDDB:
Parallel:
SAS Tables and Views
SAS/ACCESS Views
SAS/MDDB
SPDS Parallel Server
14
SAS/Access Data Views
• Architecture
• Features
• Positioning
15
SAS Tables:
Architecture
Data Views:
Data Files:
View
Definitions
Data
Definitions
Path to
Data
SAS Data
DB2
Ingres
Oracle
SAS Tables: Types
1) SAS Data Files:
A SAS Data File is a SAS Table that stores descriptor
information (data definitions) and observations (data values) in
the same location.
2) SAS Data Views:
A SAS Data View is a SAS Table in which the descriptor
information (data definitions) and the observations (data
values)
are obtained from separate files. SAS Data Views store the
information (metadata) required to retrieve data values that
are
stored in other files.
SAS Views are used to define subsets of data which saves
storage space. SAS Views also reduce maintenance costs because
changes made to the source data are automatically reflected by the
view’s results. They can be created with either SQL or ACCESS
procedures or the DATA step language.
In this section we will focus on SAS/ACCESS views which allow
the SAS system to access data stored in relational, hierarchical, and
network databases and other non-SAS data repositories.
16
Multiple Engine
Architecture
Engines
17
SAS/ACCESS Data
Views: Features
• Leverages IT investment in DBMS technology
• Leverages database client/server
technologies to supplement SAS solutions
SAS/ACCESS views provide access to OLTP and legacy data
sources, including relational, hierarchical, and network databases as
well as ODBC-compliant data stores or other DSS databases, such as
Redbrick which is a data store based on a star schema model instead
of the traditional relational database model.
SAS/ACCESS views allow IT to leverage their investment in DBMS
technology. The SAS system can access their existing data sources
and does not require expensive and time-consuming data conversion
processes.
SAS/ACCESS views leverage the client/server technologies of existing
data sources to supplement our own solutions as part of our MVA and
MEA strategies to access virtually all data sources in an organization
no matter where the data resides in terms of both the platform and
storage repository.
18
SAS/ACCESS Views:
Positioning
“ . . . to make enterprise data,
regardless of their source or
structure, a generalized and
available resource.”
19
SAS Data Storage
Strategy: RAMP
•
•
•
•
Relational:
Access:
MDDB:
Parallel:
SAS Tables and Views
SAS/ACCESS Views
SAS/MDDB
SPDS Parallel Server
20
SAS/MDDB Multidimensional Database
• Architecture
• Features
• Positioning
21
Tables vs.
Multidimensional data model
Sales
Part
person Date no.
#
Price
Jones
12 Sept A307
34
18.37
Smith
12 Sept A482
27
21.04
Jones
13 Sept B32
100
3.78
Henter 13 Sept B32
250
3.78
Wilson 12 Sept A307
29
18.37
Smith
39
2.22
12 Sept B19
Geography
Time
Product
Sales for East, Sept. & A307
The basic enhancement a focus on multidimensionality offers to client
applications is access to data, organized in a fashion that aligns with
the way business end-users understand their enterprise. This is
accomplished by defining “dimensions” of the business, and then
providing summaries of data within the definitions of those
“dimensions”. This enables users to ask the more typical query which
involves a number of factors (e.g. geography, time, product, budget
vs. actual, etc.) without being familiar with the underlying organization
of the data.
Tables, or relational tables, in data warehouses, typically are
organized in a subject oriented fashion, which should cut down on the
number times an end-user has to “join” tables together to answer a
question. The use of SQL to manage retrieving data from tables is
NOT an end-user task, and can actually be cumbersome, even for
professional developers.
22
Dimensional Definitions
Organization
(Sub Division, Application)
Platform
(OS group, OS)
Product
(Product)
Time
(Year, Month)
Geographic
(Country, Sales Person)
Scenarios
(Actual, Budget, Variance, LY, LY DIff)
Just what are dimensions? Here I will borrow a few paragraphs from a
very good description of multidimensionality in “Managing
Multidimensional Data: Harnessing the Power”, Database and
Programming Design, Volume 8, Number 8, August 1995, pages 2433.
The data fundamental to all multidimensional analysis represents
unique object or event instances, denoted by descriptive attributes. ....
The dimensions typically fall into two categories: fact and descriptive.
“Fact” dimensions represent some quantifiable or measurable aspect
of the observed objects or events. Normally the fact dimension is
represented by numeric, detailed individual values (such as units,
dollars or temperature) for instance of the measured object or event.
...
The other aspects of information collected about each event or object
are used as “descriptive” dimensions. These dimensions are usually
less granular; they have a smaller range of values (for example, color)
and are used to order, group, or summarize the values of the fact
dimensions. Each dimension has a name and a set of elements, or
values, that can be legitimately assigned to instances for that
dimension.
So, dollar value of sale would be a “fact” dimension, while product
name, region or time period would be “descriptive”.
23
SAS/MDDB server
Subtable
Request
Lookup
table
Reach through
Base Table
The SAS/MDDB server is a new multidimensional database which is a
specialized data storage facility (not a SAS table), where data may be
pulled from a data warehouse or other data sources for storage in a
matrix-like format for fast and easy access by tools such as
multidimensional viewers. The SAS/MDDB is a read-only data
storage structure containing summarized information.
The SAS MDDB object follows a two step methodology for providing
multidimensional (OLAP-style) access to end user applications. The
first step is the creation of the MDDB N-Way crossing. This
represents a "fact table" of the full list of crossings specified in the
creation phase of the MDDB. Levels with valid values are stored, thus
addressing the "sparsity" problem in the first phase. This step has
shown significant reduction in size of data as compared to the target
base table. Some of this reduction is due to subsetting the number of
columns retained. From this N-Way crossing, any number of sub
tables can additionally be created. These subtables are simply
specific instances (and usually subsets) of the larger N-Way. Data
warehouse designers will observer significant improvement in
performance depending on how well they can match up anticipated
application demands, with the creation of sub tables. The trade off will
be expanded use of disk space to store the subtables. The resulting
MDDB is limited to a 2 GB. file size, but the SAS input files are not
subject to a 2 Gb. restriction.
24
Size Relationship
Tables
Tables vs.
vs. MDDB
MDDB
An MDDB will generally be much smaller than the base table or tables
it is created from. The reasons for this include:
•atomic level detail not stored in MDDB
•storing calculations is optional
•summary records compressed where possible
•dimensional labels stored separate from tables, saving space on each
record
The actual size reduction will vary depending on a variety of factors.
So, a structure with many discrete dimensions will result in a larger
MDDB than one with fewer. The number of analysis variables kept will
effect size (e.g. more analysis variables, larger size). Whether
calculations are stored also can have a significant impact on size.
Calculations are often done more efficiently on the subset of interest to
the client...and can be quickly accomplished on client machines.
Storing pre-calculations can lead to much larger MDDB structures.
For performance purposes, subtables can also be specified. These
higher level summarizations usually take up very little space relative to
the central fact table, and are scanned first by client applications, to
see if they satisfy the requests being made.
25
MDDB/Server: Features
• Built from SAS tables or views
• Sparsely populated
• Reach-through to more detailed data if
requested
• Look across different hardware
platforms with remote library services
Sparsity:
1) The MDDB administrator has control on how many subtables to
build
2) SAS/MDDB only stores cells with data
3) SAS/MDDB only stores up to 8 statistics for every analysis field; 13
other statistics are derived at run time. For example, average is
calculated from sum and n.
26
SAS/MDDB: Positioning
•
•
•
•
Fast loading
Data reduction
Fast retrieval and query processing
Reach-through to source data
What is the benefit to customers?
The SAS/MDDB data model for online analytical
processing (OLAP) will enable better decision making
by giving business users quick, unlimited views of
multiple relationships for large quantities of summarized
data.
27
SAS Data Storage
Strategy: RAMP
•
•
•
•
Relational:
Access:
MDDB:
Parallel:
SAS Tables and Views
SAS/ACCESS Views
SAS/MDDB
SPDS Parallel Server
28
SAS/MDDB Multidimensional Database
• Architecture
• Features
• Positioning
29
Introduction to SPDS
Scaleable Performance Data Server:
• A high performance multi-user server designed
for storing and retrieving data
• A complete C/S solution - fully scaleable
• A data repository optimized for Data
Warehousing
• Fine Granularity Security
Typical Client/Server configuration:
SAS or ODBC Client on Windows 95
SPDS Server on SUN SPARC 1000, 4 CPU node running Solaris
2.5
Currently, the SPDS server is available on Solaris 2.5 platforms. The
server will be available on other UNIX server platforms in a future
release. The aspect of Solaris 2.5 used by the SPDS server that is not
directly portable to other UNIX operating systems is the use of Solaris
light -weight threads.
For more information on SPDS, see the PRODUCT POSITIONING
on INFOSITE at URL:
http://www.unx.sas.com/mkt/smw/products/spdssrv/position/overview.
html
30
SPDS Table Component
Files Architecture
SPDS Table
Index(es)
SPDS Table
Table
Metadata
Table
Data
Index
Data
Ranking
Data
SPDS data storage structures are NOT SAS tables, but these
structures can be MAPPED to SAS tables, and as such are
accessible to the SAS system 4GL and most SAS diverse user
interfaces. It is also important to note that the SPDS data store is
open via ODBC and is accessible to non-SAS client software.
SPDS storage detailed information:
1. SAS tables cannot operate at the PAGE level of data access;
SPDS can.
2. The TABLE DATA in the diagram above is stored as SEGMENTS,
not PAGES.
3. The INDEX DATA in the diagram above can either be stored as
a segmented bitmap index or as a B-tree
4. The RANKING DATA in the diagram above is a global B-tree (same
as the global B-tree index for SAS tables), and is used primarily for
SAS “BY STATEMENT” processing. Hint: Sort data first for better
performance.
31
SPDS: Features
• Plug-Compatibility
with SAS (libname)
• Parallel queries
• Parallel index
creation
• Parallel append
• Parallel sorts
• Advanced Bitmap
indexing
• Large files - over
2 Gb.
• Libname Domain
Name Server
• Centralized
Security - ACL's
Plug- compatibility with SAS as SPDS client:
Uses Libname engine interface
Supports SAS data files, indexes, catalogs, & utility files
PROC COPY/APPEND or Datastep for loading
Member-level locking
UNIX utilities for moving, backup/restore
Note: Coming with SPDS Version 2 is full SQL Passthrough engine
support to process SQL joins at the SPDS server end, instead of SAS
or ODBC client end of data processing system implementation.
The SPDS name server maps administrative information into a domain
of SAS tables and provides transparent mapping of LIBNAME
definitions to specific SPDS data stores.
Data access to SPDS data is granted through Access Control Lists
(ACL’s) for the following permissions: ALTER, READ, WRITE,
CONTROL. Default is no access to users other than the data owner.
32
SPDS: Positioning
• Performance gains in data access
• Concurrent user access & member level
locking
• Compatible with existing SAS programs
(libname)
• Optimized for Data Warehousing (reading
large volumes of data)
• ODBC client applications
• Network optimization
33
SAS Data Storage
Strategy: RAMP
•
•
•
•
Relational:
Access:
MDDB:
Parallel:
SAS Tables and Views
SAS/ACCESS Views
SAS/MDDB
SPDS Parallel Server
Our data storage strategy for the SAS System is RAMP - Relational,
Access, MDDB, and Parallel.
The RAMP acronym for SAS data storage repositories comes from the
article “Orlando II ramps up storage options,” SAS Communications,
A Quarterly Magazine for European SAS Software Users, 4Q ‘96,
pages 22-23.
A white paper on SAS data storage repositories is also being written
by Rick Evans in our European office, and will be available to you
soon
as an additional resource.
34
SAS Data Storage/Target
Application Chart - RAMP
DW
DM
OLAP DSS
Relational Tables
ACCESS Views
MDDB
Parallel (SPDS)
Legend: DW = DATA WAREHOUSING, DM = DATA MINING,
DSS = DECISION SUPPORT SYSTEMS,
OLAP = ONLINE ANALYTICAL PROCESSING
The major markets for the SAS System are Decision Support
Applications, Data Warehousing, Applied Analysis, Data Mining, and
Business Intelligence, and OLAP (Online Analytical Processing).
SAS does not play in the OLTP (Online Transaction Processing)
market, which is dominated by the relational database vendors. SAS
is not optimized for transaction processing operations, and we
recommend that users not run SAS analytical and reporting tools
directly against their transactional databases while data entry
operations are in progress. Since OLTP transactions are continuously
being updated, reports reflect the current snapshot of data at that
particular moment in time. The same report run a few minutes later
will reflect different results, based on that particular time slice. This
phenomenon is known in the industry as the “twinkling data” problem.
Instead, we recommend that batch programs be run to load and
periodically refresh a SAS data repository which provides a stable data
store for the reporting and analysis needs of business users. The
SAS/Warehouse Administrator is our product offering to coordinate
these activities, as well as data cleansing, transformation, and
summarization operations. We can also easily plug into existing data
warehouses.
If users want to run reports directly against ACCESS views to
OLTP data stores, we recommend nightly reports when data entry
operations are not in progress.
35
Time for questions ...
How about your storage R
AM
P
?
Thank you very much for your attention!
36