Download Building a Data Warehouse with SAS Software in the UNIX Environment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Building a Data Warehouse with SAS® Software
in the Unix® Environment
Karen Grippo, Dun & Bradstreet, Basking Ridge, NJ
John Chen, Dun & Bradstreet, Basking Ridge, NJ
Lisa Brown, SAS Institute Inc., Cary, NC
ABSTRACT
THE EVALUATION PROCESS
A common trend in corporations today is "downsizing" moving applications from expensive mainframes to more
affordable mid-range platforms. The Information Center
of Dun & Bradstreet Information Systems, faced with the
challenge of moving fulfillment off the mainframe, has
chosen SAS software not only as the vehicle for its data
analysis and delivery but also as its database
management system.
Due to the diversity of projects and data reqUirements,
some time and analysis was required to identify an
application suitable for migration. Once the application
and the supporting data elements were chosen, a
database management system had to be selected.
Several key requirements were identified: (1) The time
required to load the database had to be minimized. Since
committments to customers would be at stake, a solution
requiring more than one weekend to load the data would
not be acceptable and less than twelve hours was
preferable, in the event that the data had to be reloaded
overnight.
(2) The majority of projects involved
appending Dun & Bradstreet data to files with less than
50,000 records using the DUNS Number® as the primary
key for retrieving data (the DUNS Number is a unique
number, like the Social Security number, that is assigned
to every company in the database). (3) The time required
to implement the database and transition the group from
the mainframe to the Unix platform should be as short as
possible.
Twenty gigabytes, extracted from mainframe legacy
databases, have been loaded onto an HP-UX® platform
as indexed SAS data files. Two data structures (a table
mapping the data files to file systems and a data
dictionary mapping data elements to data files) and a set
of SAS macros have been developed to give the
Information Center developers and business users
transparent access to the data, freeing them from the
need to know data file or data element locations.
In a future release of the SAS System, SAS Institute
plans to support an integrated data dictionary. The
Version 6 application presented in this paper had to
define its own dictionary component. A post-version 6
preview of several dictionary features will be discussed at
the end of this paper.
The evaluation process covered not only traditional
RDBMS's (Sybase® and Informix®), but also a new
product from Red Brick Systems® (a product optimized
for data warehousing, querying and decision support),
and lastly, the SAS System.
The SAS System was
chosen for several reasons: (1) The time quoted by
Sybase and Informix for the database load was anywhere
between 3 days and 1 week. Benchmarks with SAS
software indicated that both the download of the raw data
and the building and indexing of the SAS data files could
be completed within a ten hour window. (2) No other
THE CHALLENGE
like many corporations today, Dun & Bradstreet
Information Systems is trying to reduce costs by moving
applications from the mainframe to more affordable midrange platforms.
The Information Center of the
Technology & Business Systems Division was given the
challenge of reducing its use of mainframe MIPS by
designing and implementing an alternate platform
solution. Historically, the role of the Information Center
within the organization has been to provide customized
development, analysis and fulfillment to both intemal and
external customers.
The requests for information
received are as varied as the data collected by Dun &
Bradstreet and in the course of fulfillment, probably every
database and data source has been accessed at one time
or another. Since its inception, the group has used the
SAS System for its extensive data manipulation,
analytical, and reporting capabilities.
product offered as comprehensive a set of integrated
tools as the SAS System - tools not only for complex data
manipulation, but for analysis and presentation as well.
Ultimately, SAS software provides tools that we can give
to our business users to empower them to meet more of
their own information delivery needs. (3) The level of SAS
software expertise within the group meant an easier
transition to the Unix platform. Less time would be
required for implementing the data warehouse and less
time would be required to begin fulfillment on the new
platform. (4) The 'portability of 'SAS code meant that
many of the existing mainframe procedures could be
ported to the Unix environment with little or no change.
Since SAS language is virtually platform-independent, our
519
developers could be productive immediately, with very
little retraining necessary.
type of operation, only an index on the DUNS number is
required. Second, it has been difficult to identify a
reasonable number of fields on which to create secondary
indexes.
The decision-support type projects which
required "sweeping" the entire database, looking for
records matching certain selection criteria, have had
insufficient commonality in terms of data elements used
THE IMPLEMENTATION
Once the decision had been made to use SAS software
as the data management tool, one operating system
constraint became a very important issue. In the Unix
operating system, the size of any file is restricted to 2 GB.
Some DBMS packages, like Sybase, manage their own
database space and thus, have no limitation on the size
of a table. The SAS System, however, does not manage
its own space, dictating that the database files had to
comply with the 2 GB limit.
for selection.
Creating too many indexes would add
significant overhead in terms of space and require
additional time to build the database. This might not be
offset by gains in turnaround time. Thus, it has been
decided to defer the issue of creating additional indexes
until further analysis can be done. Since space is not an
issue, it has been decided not to compress the datasets,
allowing for faster retrieval.
As described above, the "main" database is segmented
by business indicators, comprising 13 categories broken
into 21 data segments. Retrieving a record using the
DUNS Number index could potentially require searching
all 21, querying each until a particular DUNS number is
located - not a desirable solution. To resolve this issue,
a more efficient two-level indexing scheme has been
developed. The first level is a SAS data file, indexed by
DUNS number, containing all 19+ million cases and
pointers (in the form of a code) to the main segments in
which they are located. At the second-level, the main
segments themselves are indexed by the DUNS number.
This allows the retrieval of data for any DUNS number in
only two steps: (a) Use the first-level index to determine
which main segment contains each DUNS Number and
then, (b) follow the pointers to the appropriate segments,
using the DUNS Number to directly access the correct
This indexing scheme is
record in each segment.
illustrated in Figure 2.
One obvious choice and, probably the Simplest, for
segmenting the data would have been to split the data
sequentially by the DUNS Number. However, a different
approach was adopted with the following strategy.
First, in an effort to minimize the size of the database,
some data elements have been separated into smaller
datasets, generically called "support" files. These support
files contain data elements whose frequency of
occurrence in the data is low relative to the amount of
space they occupy. For example, the mailing address of
a company occupies 65 bytes but is present in only 17%
of the cases. By placing the mailing address fields in a
separate data file, more than 1 GB of space is saved. In
this manner, more than 10 separate support files have
been created, saving 4+ GB in space.
Second, the remaining data elements, comprising a 600+
byte extract for over 19 million cases (12 GB), are
segmented using five key business indicators which place
each company into one of 13 different categories, called
An example of a key business
'main" segments.
indicator is one which determines if a company is out of
business or not. The advantage of our scheme is that
certain categories of records can automatically be
included or excluded, without even reading them, simply
by knowing into which category they fall.
For projects
where we need to query the entire database, we
frequently want to eliminate certain types of companies
altogether (e.g. those which are out of business). In
these circumstances, our database design allows us to
avoid processing millions of records, using less valuable
DATA ACCESS METHODOLOGY
The complexity of the database design just described,
translates to complexity in the SAS code required to
retrieve data. As an application developer or business
user wishing to retrieve data, you would have to know, for
each element you wanted to extract, whether it is in a
main segment or a support file, and if it is in a support
file, which one. "Main" data elements must be retrieved
from multiple files since the data is segmented. To
further complicate matters, since a given file cannot
exceed 2 GB, data files need to be split into two as they
reach the size limit. Developers would have to oanstantly
monitor the state of the database and adjust their code as
changes occurred - an undesirable situation and a
nightmare for program maintenance.
CPU resources and giving a faster turnaround.
Some segments still exceed the 2 GB limit and have to be
further subdivided by the DUNS Number. The entire
segmentation scheme is illustrated in Figure 1.
To address this problem, meta data (that is, data about
the data), in the form of two data structures, and a set of
SAS macros have been created. This new data access
methodology makes the number, type and location of the
SAS data files as well as the location of individual data
elements transparent to the application developer and the
business user.
Indexing the Database
It has been decided to index the data sets only on the
DUNS number for several reasons. First, the nature of
the application being implemented is such that most of
the projects involve taking a DUNS-numbered file of
customer data and appending Dun & Bradstreet data
using the DUNS number as the primary key. For this
520
The first data structure is an ASCII file containing the
name of each SAS data file in the database. the file
system on which it resides. and a code indicating whether
it is a main or support file. Any time a data file is added.
deleted or moved. a change is made to this table.
Certain naming conventions have also been adopted.
File system names have to be 8 characters or less so that
they can also be used as SAS librets. The main files are
named after the 13 categories. with a number added at
the end to indicate how many sub-segments that category
contains. For example. if the categories are given as A.
B ..... M then the SAS data files might be named Al. Bl.
B2. 93. .. .• Ml indicating that category A maps to one
datase~ B to three. and M to one.
provides some of the features that will be available in the
SAS data dictionary. The advantage of using a data
dictionary provided by the SAS Institute is that it will be
fully integrated into the product. Some of the attribtues of
an integrated data dictionary are:
• The metadata set up for an application like the Dun &
Bradstreet application could be recognized by other
components of the product (PROC REPORT. for
example).
• The validation rules would be enforced throughout the
system. If a variable. for example. has a value
constraint then that constraint would be enforced by
any application (PROC FSEDIT. PROC APPEND.
etc.) which could update the file.
The second data structure is a data dictionary. Created
using PROC CONTENTS and SOme manually entered
information. it is a SAS data file containing one record for
each variable in the database with its name. type. length.
and the segment from which it comes.
Additional
information. such as the mainframe source of the data
element and a long text description of the variable. has
been added to help users understand the contents of the
database and for the creation of a hardcopy data
dictionary for documentation.
The SAS data dictionary will provide the ability to define
metadata for libraries. SAS files and variables. as well as
external files. The metadata can be information typically
found in the SAS System such as variable name. variable
type and variable format. or. it can be information that is
not part of the version 6 metadata.
This includes
metadata that will be provided by SAS Institute. such as:
• Prose descriptions of libraries. datasets. and
variables.
'* Icon associations for variables
'* User-written constraints on variable values
• Drill down Indicator
• Classification Status
A set of SAS macros utilizes these two data structures.
The main macro. APND. defines the user's interface to
the database. The user simply provides the names of the
variables requested (up to a maximum of 100). the name
of the input SAS data file containing the DUNS Numbers
to which data will be appended and the name of the SAS
data file that will contain the output. The macros do all
the work. The variables requested are matched to the
data dictionary to determine in which datasets they
reside. A PROC Sal is constructed for each dataset.
extracting the appropriate data elements.
All the
individual files are then combined into one data set and
returned to the user. This data retrieval process is
illustrated in Figure 3.
Metadata can also be extended to include information
that is specific to an application. An example of this type
of metadata is the segment a variable comes from. as
defined in the Dun & Bradstreet application.
Another type of metadata that is important to most
applications is information describing how different
structures relate to each other.
There are many
relationships within the SAS System that will be
described in the data dictionary. such as the link between
a library and its SAS files or a dataset and its variables.
There will also be relationships that allow for special
behavior within the product. such as a relationship that
can propagate changes to meta data among variables.
For instance. if the same variable is used in several
datasets and there is a desire to keep the format and the
label the same for each occurrence of the variable. a
relationship could be defined that would automatically
propagate any changes to the format or label of one
occurence of the variable to all the others.
The design of this data access methodology embodies
object-oriented concepts. even though SAS software is
not traditionally considered an object-oriented language.
These two data structures. plus the SAS macros which
utilize them. encapSUlate everything any program needs
to know to extract data from the database. Programs
become virtually maintenance-free.
A PREVIEW OF THE SAS INTEGRATED DATA
DICTIONARY
As with the other types of metadata described above.
applications may extend the power of the SAS Institute
data dictionary by defining their own relationships among
the metadata.
Complex interconnections among the
structures in the SAS System can be defined by creating
relationships that connect them.
SAS Institute has recognized the need for its internal. as
well as user-written applications. to have a method of
storing and retrieving meta data about structures within
the SAS System. A post-version 6 release of the SAS
System will provide this functionality in the form of a data
dictionary.
The goal at SAS Institute is to make the data dictionary
extensible enough that it can be 'useful for internal as well
as user-written applications. The Dun and' Bradstreet
The Dun and Bradstreet application presented in this
paper has a component which manages metadata and
521
application uses several of the features that will be
provided by the SAS data dictionary. The benefit they
would receive by using the dictionary is that the
functionality would be automated as well as integrated
into the SAS System.
SUMMARY
The t~sk of building an alternate platform solution is
difficult and complex. The SAS System may not be an
obvious choice for implementing a data warehouse, but
our experience illustrates that it is certainly a viable
alterniltive.
The SAS System provides traditional
database management features such as indexing, data
compression, support for SQL, and the creation of data
views. With SAS Macro Language, you have a powerful
facility for building the infrastructure of a data warehouse
and for building a sophisticated mechanism for
transparent data access.
ACKNOWLEDGMENTS
SAS is a registered trademark of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration.
Brand and product names are registered trademarks or
trademarks of their respective companies.
522
Data Segmentation
Suooor! Files
(1) Primary Kay· DUNS Number
(2) Each has a dlfl'eAlnt layout
with !hat data
(3) Only _
Main Segments
(1) Primary Kay· DUNS Number
(2) Each has same layout
(3) Segmenled by 5 Indicators
00 o
000
MaHing
Address
SIC
Codes
Public
Records
Type 1
A Main Segment Stili
Too Large Must be
SUb-Segmenled
Type 2 •••
Type 13
/~"x
-000
Type
2a
Type
2b
Type
2c
Figure 1
Indexing Scheme
1st Level
~
3
Fila Ptr
Type2a
Type 13
Type 1
n
Type 1
1
2
2nd Level
0
0
Type 1
llnclexed
by
DUNS.
Type 2a
llndeaed
by
DUNS.
0
Type 13
Figure 2
523
llnclexed
by
DUNS.
Data Appendage Flow
Step 1: IIMr Req_ts Data
%apnd (name, malladdr, slc1)
Step2:Map
Data Elements
1
Reqll8Sl8dto
Data Dictionary
YD!!!! .ImL
name
addr
phone
trdstyI1
mailaddr
slc1
slc2
main
main
main
tnlsty
mall
sics
sics
DATA DICTIONARY
Step 3: Map Data Type to
DB Table & Loca1Ion
!!IlL fl!!..§n
type1
t-:::::1I;",,,f--r:;
1cdata4
Icclata7
1ccIata15-f--*:-t-~
Icdata9
tnIstyIe Icclata10
mall
Icdata9 t-==~d-~d
Sics
Icclata1
type2a
type2b
type13
FILE SYSTEM MAP
Step 5: C.nNIIe
Data Set
Figure 3
524
outp:J
I output IIle I