Download Building a Data Warehouse With SAS Software in the UNIX Environment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Relational model wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Transcript
Host Systems
Building a Data Warehouse with SAS® Software
in the Unix® Environment
Karen Grippo, Dun & Bradstreet, Basking Ridge, NJ
John Chen, Dun & Bradstreet, Basking Ridge, NJ
ABSTRACT
required to load the database had to be minimized. Since
commitments to customers would be at stake, a solution
requiring more than one weekend to load the data would
not be acceptable and less than twelve hours was
preferable, in the event that the data had to be reloaded
overnight.
(2) The majority of projects involved
appending Dun & Bradstreet data to files with less than
50,000 records using the DUNS Number® as the primary
key for retrieving data (the DUNS Number is a unique
number, like the Social Security number, that is assigned
to every company in the database). (3) The time required
to implement the database and transition the group from
the mainframe to the Unix platform should be as short as
possible.
A common trend in corporations today is "downsizing" moving applications from expensive mainframes to more
affordable micJ.range platforms. The Information Center
of Dun & Bradstreet Information Services, faced with the
challenge of moving fulfillment off the mainframe, has
chosen SAS software not only as the vehicle for its data
analysis and delivery but also as its database
management system.
Twenty gigabytes, extracted from mainframe legacy
databases, have been loaded onto an HP-UX® platform
as indexed SAS data files. Two data structures (a table
mapping the data files to file systems and a data
dictionary mapping data elements to data files) and a set
of SAS macros have been developed to give the
Information Center developers and business users
transparent access to the data, freeing them from the
need to know data file or data element locations.
The evaluation process covered not only traditional
RDBMS's (Sybase® and Informix®), but also a new
product from Red Brick Systems® (a product optimized
for data warehousing, querying and decision support),
and lastly, the SAS System.
The SAS System was
chosen for several reasons: (1) The time quoted by
Sybase and Informix for the database load was between
3 and 7 days. Benchmarks with SAS software indicated
that both the download of the raw data and the building
and indexing of the SAS data files could be completed
within a ten hour window. (2) No other product offered as
comprehensive a set of integrated tools as the SAS
System - tools not only for complex data manipulation,
but for analysis and presentation as well. Ultimately,
SAS software provides tools that we can give to our
bUSiness users to empower them to meet more of their
own information delivery needs. (3) The level of SAS
software expertise within the group meant an easier
transition to the Unix platform. Less time would be
required for implementing the data warehouse and less
time would be required to begin fulfillment on the new
platform. (4) The portability of SAS code meant that
many of the existing mainframe procedures could be
ported to the Unix environment with little or no change.
Since SAS language is virtually platform-independent, our
developers could be productive immediately, with very
little retraining necessary. (5) SAS is optimized to read
SAS data files, as opposed to tables from other DBMS's.
Since we were going to use SAS to present our data,
regardless of the database we chose, storing our data in
SAS data files enhanced our performance.
THE CHALLENGE
Like many corporations today, Dun & Bradstreet
Information Services is trying to reduce costs by moving
applications from the mainframe to more affordable midrange platforms.
The Information Center of the
Technology & Business Services Division was given the
challenge of reducing its use of mainframe MIPS by
designing and implementing an altemate platform
solution. Historically, the role of the Information Center
within the organization has been to provide customized
development, analysis and fulfillment to both internal and
extemal customers.
The requests for information
received are as varied as the data collected by Dun &
Bradstreet and in the course of fulfillment, probably every
database and data source has been accessed by the
group at one time or another. Since its inception, the
InfoCenter has used the SAS System for its extensive
data manipulation, analytical, and reporting capabilities.
THE EVALUATION PROCESS
Due to the diversity of projects and data requirements,
some time and analysis was required to identify
applications suitable for migration. Once the applications
and the supporting data elements were chosen, a
database management system had to be selected.
Several key requirements were Identified: (1) The time
367
Host Systems
THE IMPLEMENTATION
records matching certain selection criteria. have had
insufficient commonality in terms of data elements used
for selection. Creating too many indexes would add
significant overhead in terms of space and require
additional time to build the database. This might not be
offset by gains in turnaround time. Thus. it has been
decided to defer the Issue of creating additional indexes
until further analysis can be done. Since space is not an
issue. it has been decided not to compress the datasets.
allowing for faster retrieval.
Once the decision had been made to use SAS software
as the data management tool, one operating system
constraint became a very important issue. In the Unix
operating system, the size of any fi Ie is restricted to 2 GB.
Some DBMS packages, like Sybase, manage their own
database space and thus, have no limitation on the size
of a table. The SAS System, however, does not manage
its own space, dictating that the database files had to
comply with the 2 GB limit.
As described above. the 'main' database is segmented
by business indicators, compriSing 13 categories broken
into 21 data segments. Retrieving a record using the
DUNS Number index could potentially require searching
all 21. querying each until a particular DUNS number is
located - not a desirable solution. To resolve this issue.
a more efficient two-Ievel indexing scheme has been
developed. The first level is a SAS data file. indexed by
DUNS number. containing all 19+ million cases and
painters (in the form of a code) to the main segments in
which they are located. At the second-level, the main
segments themselves are indexed by the DUNS number.
This allows the retrieval of data for any DUNS number in
only two steps: (a) Use the first-level index to determine
which main segment contains each DUNS Number and
then. (b) follow the pointers to the appropriate segments.
using the DUNS Number to directly access the correct
record in each segment
This indexing scheme is
illustrated in Figure 2.
One obvious choice and, probably the simplest,. for
segmenting the data would have been to split the data
sequentially by the DUNS Number. However, a different
approach was adopted with the following strategy.
First, in an effort to minimize the size of the database,
some data elements have been separated into smaller
datasets, generically called 'support" files. These support
files contain data elements whose frequency of
occurrence in the data is low relative to the amount of
space they occupy. For example, the mailing address of
a company occupies 65 bytes but is present in only 17%
of the cases. By placing the mailing address fields in a
separate data file. more than 1 GB of space is saved. In
this manner. more than 10 separate support files have
been created. saving 4+ GB in space.
Second, the remaining data elements. comprising a 600+
byte extract for over 19 million cases (12 GBl. are
segmented using five key business indicators which place
each record into one of 13 different categories. called
'main' segments.
An example of a key business
indicator is one which determines if a company is out of
business or not The advantage of our scheme is that
certain categories of records can automatically be
included or excluded, without even reading them, simply
by knowing into which category they fall. For projects
Where we need to query the entire database. we
frequently want to eliminate certain types of companies
altogether (e.g. those which are out of business). In
these circumstance&. our database design allows us to
avoid processing millions of records, using less valuable
CPU resources and gMng a faster turnaround.
DATA ACCESS METHODOLOGY
The complexity of the database design just described.
translates to complexity in the SAS code required to
retrieve data. As an application developer or business
user wishing to retrieve data. you would have to know. for
each element you wanted to extract. whether it is in a
main segment or a support file, and if it is in a support
file. which one. 'Main' data elements must be retrieved
from multiple files since the data is segmented. To
further complicate matters, since a given file cannot
exceed 2 GB. data files need to be split into two as they
reach the size limit. Developers would have to constantly
monitor the state of the database and adjust their code as
changes occurred - an undesirable situation and a
nightmare for program maintenance.
Some segments still exceed the 2 GB limit and have to be
further subdivided by the DUNS Number. The entire
segmentation scheme is illustrated in Figure 1.
To address this problem, metadata (that is. data about
the data). in the form of two data structures. and a set of
SAS macros have been created. This new data access
methodology makes the number. type and location of the
SAS data files as well as the location of individual data
elements transparent to the application developer and the
business user.
Indexing the Database
It has been decided to index the datasets only on the
DUNS number for several reasons. First, the nature of
applications being implemented is such that most of the
projects involve taking a DUNS-numbered file of
customer data and appending Dun & Bradstreet data
using the DUNS number as the primary key. For this
type of operation. only an index on the DUNS number is
reqUired. Second. it has been difficult to identify a
reasonable number of fields on which to create secondary
indexes.
The decision-support type projects which
require 'sweeping' the entire database, looking for
The first data structure. called the File System Map, Is an
ASCII file containing the name of each SAS data file in
the database. the file system on which it resides. and a
code indicating whether it is a main or support file. Any
time a data file is added, deleted or moved, a change is
made to this table. Certain naming conventions have
368
Host Systems
also been adopted. File system names have to be 8
characters or less so that they can also be used as SAS
librefs. The main files are named after the 13 categories,
with a number added at the end to indicate how many
sub-segments that category contains. For example, If the
categories are given as A, B, ... ,M then the SAS data
files might be named A1, B1, B2, 83, ... , M1 indicating
that category A maps to one dataset, B to three, and M to
one., and so on.
the past several years. Access to certain databases is
severely restricted, sometimes eliminated, during the day.
Virtually any job requires an overnight run. If an error is
made in JCL or code, another day could be lost. The
altemate platform is our dedicated resource. A job on the
platform typically runs in under 30 minutes and can be
run at any time, with no queues.
(3) Greater customer satisfaction. The two factors
listed above both translate to increased customer
satisfaction. Projects. which would have taken two to
three days on the mainframe, are usually completed
within one day, allowing tight deadlines to be met,
sometimes saving contracts and revenue.
The second data structure is a data dictionary. Created
using PROC CONTENTS and some manually entered
information, it is a SAS data file containing one record for
each variable in the database with its name, type, length,
and the segment from which it comes.
Additional
information, such as the mainframe source of the data
element and a long text description of the variable, has
been added to help users understand the contents of the
database and for the creation of a hardcopy data
dictionary for documentation.
(4) Greater data consistency. By pulling data from a
variety of sources, and then cleansing and combining it,
we have been able to provide a comprehensive source of
data that has a high degree of conSistency and quality.
This type of resource has great potential to be leveraged
throughout the organization.
A set of SAS macros utilizes these two data structures.
The main macro, APND, defines the user's interface to
the database. The user simply provides the names of the
variables requested (up to a maximum of 100), the name
of the input SAS data file containing the DUNS Numbers
to which data will be appended and the name of the SAS
data file that will contain the output. The macros do all
the wOrk. The variables requested are matched to the
data dictionary to determine in which datasets they
reside. A PROC SQL is constructed for each dataset,
extracting the appropriate data elements.
All the
indMdual files are then combined into one data set and
returned to the user. This data retrieval process is
illustrated in Figure 3.
(5) End-User Empowerment. The database interface
allows easy access to the database, freeing all users from
the need to know the details of the database
implementation. This ease of access makes it possible to
allow non-technical users to utilize the data warehouse.
Efforts are currently underway to build an application that
will give users point-and-click data retrieval capabilities.
An analytical workstation has also been built which allows
business analysts to view data and drill-down using both
summarized and detail data extracted from the
warehouse.
SUMMARY
The design of this data access methodology embodies
object-oriented concepts, even though SAS software is
not traditionally considered an object-oriented language.
These two· data structures, plus the SAS macros which
utilize them, encapsulate everything any program needs
to know to extract data from the database. Programs
become virtually maintenance-free.
The task of building an alternate platform solution is
difficult and complex. The SAS System may not be an
obvious choice for implementing a data warehouse, but
our experience illustrates that it is certainly a viable
alternative.
The SAS System provides traditional
database management features such as indexing, data
compression, support for SQL, and the creation of data
views. With SAS Macro Language, you have a powerful
facility for building the infrastructure of a data warehouse
and for building a sophisticated mechanism for
transparent data access.
RESULTS
The benefits we have experienced from implementing our
data warehouse have been substantial:
(1) Development time and effort significantly
reduced. In the past, a developer had to code one
COBOL program per database accessed, typically four
or five. Additionally, a SAS program had to be written to
combine the extracts and produce a customized file or
report. With the data warehouse, extracting data is as
simple as a function call.
Development time is
significantly reduced. Most projects can be completed in
just one step, combining data extraction, data
manipulation, and presentation of the final result.
(2) Turnaround time significantly reduced. Mainframe
processing windows have been diminishing steadily over
369
HostSystems
pata Stamentatlon
..
rlftIn
ttl """*"~. DUNS .......r
~ bell .... dIIIMM IIiJ'oUt
tJI 0I'IIf ~ ..........
(1' PrtmIIry Ke¥ • DUNS .......
(21 EKh ... _1Iyout
PI ....,....... "" 1111111;111 ...
....,.""..
"...
l.!A~Ibfn~~'~lgm~IeI'Il!_'tI_lIJ
Too ......llltbe
• • Slglz*11Iid
"..2 •••
~1J
1\"
CJ CJ Cl
-
~
,.
,.
.,
,.".
Jc
Figure 1
indexing Scheme
ieta..v.l
I2SIfU
•
I
a
"
IndLewl
,.
0 I:::"" I
0 1=="" I
.fIIt.!J[
,.".
1We1J
",.,
,.,.,.,
"..,.
1=.-:""
0
" , . 13
figure 2
370
1=-="" I
Host Systems
'IIpNI
«_.111 •••
...1:....,
1
.... EIemenIa
... ~ IIlNto
.... CI ¥ . ,
~"$r.""
8Itp 3: __ .,... TftIot to
1)8 TIMe &LUl1lon
DATA IJIC7JONARY
Figure 3
371
I
~1IIe