* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Getting connected with your DATA: Using SAS/CONNECT® and SAS/ACCESS® to work with data housed in a remote environment
Entity–attribute–value model wikipedia , lookup
Microsoft Access wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational model wikipedia , lookup
Oracle Database wikipedia , lookup
Clusterpoint wikipedia , lookup
Getting connected with your DATA: Using SAS/CONNECT® and SAS/ACCESS® to work with data housed in a remote environment Kevin Delaney, New York State Office of Mental Health, Albany, New York Abstract SAS/CONNECT : This paper will provide an overview of SAS/CONNECT and SAS/ACCESS software for accessing and manipulating databases located in a remote environment. Using the example of an Oracle database on a remotely located Unix server, the author will demonstrate many of the main features of SAS/Connect and SAS/Access. SAS/Connect topics to be covered include: connecting to the remote server, submitting SAS code remotely, and moving data back and forth between the server and client. SAS/Access topics to be covered include: interfacing with the data using both the libname statement with SAS/Access specific options and running a query against the data using the SQL pass through facility. Issues of efficiency and practicality will also be discussed. The introduction to the SAS/CONNECT User's GuideTM tells us that SAS/CONNECT is a "SAS-toSAS client/server toolkit." What exactly does this mean? SAS/CONNECT software can be used to connect to a SAS session running on a remote server, to transfer data between environments, and to process data on the remote server. I will attempt to address the multitude of methods that SAS/CONNECT provides for accomplishing these tasks within this paper. SAS/CONNECT also supports SCL commands and SAS/AF applications that allow for remote messaging, linking of objects on different platforms, and running of scheduled applications for routine updates from one server to another, but I will not cover these topics here. Getting Connected Introduction At this company, many of our large data sets are housed in an ORACLE relational database, on a Unix server. In order to access these data from our local area network, and work with them in SAS, we had to become familiar with the SAS/CONNECT and SAS/ACCESS software packages. Like most SAS® software, within these packages there are many different ways to reach the same goal. Once you become familiar with several of these methods, the only challenge is to figure out which method is most appropriate for a given circumstance. This paper will attempt to walk through several of the more common utilities of SAS/CONNECT and SAS/ACCESS software, and hopefully clarify which methods are most efficient and most practical. Throughout this paper I will stick with what I know, I will use examples involving a SAS/CONNECT session with a Unix server, and the SAS/ACCESS interface with ORACLE relational databases. For those of you who know more about other operating systems, or other database management systems, I hope you will find that my examples are adaptable to your host/entity of choice. There are several methods within SAS/CONNECT that can be used to actually connect to a remote environment. In the Windowing environment you can use the SIGNON window to connect to the remote host. Select RUN from the toolbar, then SIGNON Figure 1: Select Run and SIGNON from the Display Manager Pull down menus. This gets you the following SIGNON menu: For my example I am using a script call tcpunix2.scr, to connect to the remote host Unixdata, using the TCP/IP access method. These are the only three lines that need to be filled in, as the NOTE on the bottom of the window states, leaving a field blank will default to the current setting. The only other item you might want to change is whether or not remotely submitted commands execute synchronously, but we will discuss this more fully in a minute. Figure 2: SAS/CONNECT SIGNON window SAS/CONNECT ships with a number of script files that establish the connection between SAS on the local host and SAS on the remote host. These are specific to the remote host, but can be modified from their standard form. You can also write your own script file, instructions for doing so are included in the SAS/CONNECT User's Guide, Version 8TM. By default these script files are stored in the !SASROOT\CONNECT\SASLINK folder in Windows. SAS will also look for script files in your SASUSER folder. An example of the default TCPUNIX.scr that ships with SAS is attached to this paper, as well as my modified TCPUNIX2.scr, see if you can recognize the changes. As you might guess you can have a lot of fun with the script files, if you are so inclined. SAS/CONNECT supports several different access methods that are operating system dependant. All of my examples will involve the TCP/IP access method for communication between Unix and Windows. I will not say a whole lot about access methods other than to mention that you need to use one that may be used by both the local and remote hosts. For a more in depth discussion see: Communications Access methods for SAS/CONNECT and SAS/SHARE software, Version 8TM. With this information, and the name of the remote host onto which you would like to connect, you can then sign on to your remote SAS session using the SAS/CONNECT SIGNON window pictured in Figure 2. You would place the name of the script file you would like to use in the first line, the remote session's name in the second and your communications access method in the third. If you prefer a more programmatic approach when signing on to the remote host, the syntax is equally easy to grasp. In SAS Version 8, you need only associate the fileref RLINK with your script file and then issue the SIGNON command. For my example: filename rlink 'tcpunix2.scr'; signon unixdata; Passing SAS statements to the remote host Now that we are signed onto SAS "up" on Unix, lets send some SAS commands through and see how it works. To send SAS statements to a remote host you need only bracket your normal SAS code with an RSUBMIT; - ENDRSUBMIT; block. For instance: rsubmit; libname myunix '/home/myunixdir'; endrsubmit; Creates the LIBRARY MYUNIX within the session of SAS executing on Unix, and then returns the log from this remote session to your local SAS Log. (If we had done something that produced output, the output would also be directed down to the local output window.) Remote Library Services SAS/CONNECT also offers the ability to create a local library that refers to files on the remote session using the REMOTE engine. This is useful if you wish to use the Explorer window to look at the SAS data sets housed in your remote directories. The syntax to create a LOCAL libref to the same directory as our MYUNIX LIBRARY "up there" would be: libname mylocux '/home/myunixdir' server=unixdata; Once you have set up this remote libref you can then manipulate data on the remote host without wrapping it in an RSUBMIT; - ENDRSUBMIT block. For example: proc contents data=mylocux.set1 out=mylocux.set1contents; run; If you happen to know the directory you have been assigned on the remote host this works well, but what about viewing the work directory? You can use the SASHELP.VMEMBER data set view on your remote host to set up a local libref to your remote WORK library: rsubmit; data findwork; x=1; run; data find2(keep=path); set sashelp.vmember; if Upcase(memname)='FINDWORK'; run; proc download data=find2 out=finduxwork; run; endrsubmit; data _null_; set finduxwork; call symput("workdir",trim(path)); run; %put &workdir ; *to make sure it worked; push to your local SAS session. OUT=data set name is the name of the data set that will reside in the local session. In this case the procedure copies the data set FIND2 from Unix down into the data set FINDUXWORK on our local SAS session. This data set is then used to create the MACRO variable WORKDIR, and a remote library ref to WORKDIR is established. This seems like a lot of work, but it actually executes in tenths or hundredths of seconds, and then allows you to use the local EXPLORER WINDOW to look at data sets on the remote server, rename them interactively, and even move them to other referenced libraries on either host. You can use the remote library reference as you would any other library reference, so you can SET data on the remote host, and use it to create a local data set, you can use PROC PRINT to print data from the remote host, and well, you get the point. However, this is often not the most efficient way to use the SAS/CONNECT product. For example, lets look at the following code: HEAT # 1 data work.test; set unixwork.smallset; run; rsubmit; VS. proc download data=work.smallset out=work.test2; run; endrsubmit; libname unixwork "&workdir" server=uxdata2; Notice we are looking for the Unix WORK library so we need to SET SASHELP.VMEMBER from Unix, by using an RSUBMIT with our data set. For those of you who have not used the VMEMBER data set view in the past, it contains the attributes of all the data sets currently referenced in your SAS session. By creating a dataset in the WORK library and then selecting the variable path for that data set, we obtain the full path of our current WORK library. This example also adds a new SAS/CONNECT procedure. PROC DOWNLOAD, and its partner in crime PROC UPLOAD, are SAS/CONNECT procedures that perform data transfer. The syntax for the procedures really is as easy as it looks. For PROC DOWNLOAD DATA= data set name refers to the data on the remote host which you wish to Heat #2 data unixwork.test; set localref.smallset; run; VS. rsubmit; proc upload data=localref.smallset out=work.test2; run; endrsubmit; Heat # 3 proc format library=work cntlout=unixwork.fmts; run; rsubmit; proc format library=work cntlin=work.fmts; run; endrsubmit; VS. proc format library=work cntlout=work.fmts; run; rsubmit; proc upload data=work.fmts out=work.fmts2; run; proc format cntlin=fmts2; run; endrsubmit; I am not sure where the word HEAT comes from, but definition 10a in my dictionary does state " One round of many in a sporting competition, such as a race." This example pits remote library services against PROC DOWNLOAD/UPLOAD in a little contest to see who is faster. With relatively small numbers of observations, and particularly with small numbers of variables, these two methods come pretty close. However, PROC DOWNLOAD/UPLOAD definitely wins both HEAT # 1 and HEAT # 2. The advantage to using this procedure over the Remote library option grows wider as you add more variables and observations to the data sets you are moving between hosts. Of course if you are cleaning up for the night and interactively moving data from your Unix work directory to a permanent library it might be easier to click and drag in the EXPLORER WINDOW, but for long programs that need to be duplicable and or completely automated, PROC DOWNLOAD/UPLOAD seems to make more sense. HEAT # 3 is much closer, because there is an extra step needed to use PROC UPLOAD to move the data. Also, unless you have a HUGE FORMAT CATALOG, I don't know that the FMTS data set will ever be big enough to see a real difference in efficiency. What would be neat (this is directed to those SAS people who make this stuff happen) is if options fmtsearch = (work.formats unixwork.formats library.formats); actually worked. Unfortunately as it stands now if you try to assign formats located in the unixwork library or any other remote library using the OPTIONS FMTSEARCH=() option and a remote library reference, you won't get an error, but when you try to assign a format from a remote FORMAT catalog to a local session variable it won't work. This is because "You cannot open a catalog through a server because access to catalogs is not supported when the user machine and server machine have different data representations." (If you want to see this "NOTE" yourself double click on the FORMATS catalog as it appears in the UNIXWORK folder of the EXPLORER WINDOW.) Are we having fun yet? The best attributes of SAS/CONNECT software are still ahead of us. Not only can SAS/CONNECT talk back and forth with a remote host, but it can also do so asynchronously. To this point we have not made use of the WAIT=NO option in any of our RSUBMIT statements. This option tells SAS to send the SAS statements in the RSUBMIT; - ENDRSUBMIT; block through to the host server, but to immediately return control to the local SAS session. We haven't used this option thus far because we haven't needed it; all of the code we have submitted executed and returned results faster than we could blink. This would not be true if we were trying to pull records out of a database with a couple million records, or to perform an SQL query that combines ten tables from a relational database. In my mind the best reason for using SAS/CONNECT is to be able to send large, memory intensive tasks such as these to another server, and let the processing take place on the remote host, allowing you to be free to do other things locally. This is especially true if you store your data remotely so as not to bog down your local server. We will look more closely at the uses of the SAS/CONNECT WAIT=NO and other statements that work with it as we turn our attention to another important piece of SAS software. SAS/ACCESS software If your data is stored in a format other than a SAS data set on the remote server you are CONNECTed to, how do we ACCESS it?? In effect SAS/ACCESS software provides a SAS-toNONSAS database management software connection in the same way that SAS/CONNECT is a SAS-to-SAS connection. SAS/ACCESS allows you to read in and modify data housed in a NONSAS data storage package, and then write that modified data back out to the database. From the data analysts prospective, I don't have a need to write data back out to the database, in fact, in my job; I don't have the privilege of doing so. My focus will therefore be on the various ways to 'access' data stored in a relational database, using SAS/ACCESS, rather than on the way to write these data back out (PROC DBLOAD). Again, the examples in this paper discuss accessing ORACLE tables on a Unix server, if you are using a different DBMS, see the SAS/ACCESS User's Guide specific to your product for modifications that you might need to run these examples on your system. SAS/ACCESS software provides three main methods for accessing a relational database, The ACCESS Procedure, a DBMS specific LIBNAME statement, or the SQL Pass-through facility. I will compare and contrast the three. The ACCESS Procedure This procedure is the most code intensive method of accessing a DBMS (Those of you deathly afraid of SQL will note that I didn't say 'of using DBMS data'), although none of the code is particular difficult to grasp. The ACCESS procedure for relational databases consists of two distinct components, the ACCESS descriptor, and the VIEW descriptor. The ACCESS descriptor is a set of statements that tells SAS how to access a DBMS table. For example: proc access dbms=oracle; create work.mytest.access; user=kpd; orapw=mypassword; table=category_service; path='prda'; assign=yes; rename catsrv_code=CATCODE catsrv_label=Service; list all; This is the access descriptor for an ORACLE table called CATEGORY_SERVICE within the ORACLE instance 'prda'. The access descriptor contains the information SAS/ACCESS will need to read this table when it is called upon to do so, including my userid (USER=) and password (ORAPW=). ASSIGN=YES tells SAS that all attributes of data sets created from this ACCESS descriptor must conform to what is described here. For example, I have renamed the ORACLE field catsrv_code to be CATCODE. Any SAS data sets created using this descriptor will contain the variable CATCODE, and I will not be able to rename them in the VIEW descriptor. In addition to RENAME you can also use such familiar SAS options as FORMAT and DROP within the ACCESS descriptor. A VIEW descriptor uses the information contained in its reference ACCESS descriptor to access the database, then CREATE VIEW to "take a picture" of the data. When you create a view, you actually set up a query of the data, which can later be called by any SAS procedure or DATA step. You can also create a SAS data set from the ACCESS procedure, but it must occur after the initialization of a VIEW description. In other words while we would like: rsubmit; proc access dbms=oracle; create work.mytest.access; user=kpd; orapw=noturpassword; table=category_service; path='prda'; assign=yes; rename catsrv_code=CATCODE catsrv_label=Service; list all; create work.myview.view out=outputdataset; select catsrv_code catsrv_label; subset where catsrv_code ='96'; run; We instead need to use a second PROC ACCESS statement to create the data set: rsubmit; proc access dbms=oracle; create work.mytest.access; user=coevkpd; orapw=urnosey; table=category_service; path='prda'; assign=yes; rename catsrv_code=CATCODE catsrv_label=Service; list all; create work.myview.view; select catsrv_code catsrv_label; subset where catsrv_code ='96'; run; proc access viewdesc=work.myview out=oratable1; run; endrsubmit; Notice that I submitted this code to my SAS session running remotely. This is, even in the case of a data set with only two variables and one observation, the most efficient way of using PROC ACCESS. There are two reasons for this, first the Unix server is far less bogged down with everyday traffic than my Windows server. Even if I had a copy of this ORACLE database available locally, SAS could probably read it faster "up there." Second, since I don't actually have a copy of the data to access locally, it is much faster to access and manipulate it up where it lives than to pull the data through my network connection to Unix (which is what would happen if I submitted the code without the RSUBMIT). The LIBNAME statement The next option available to me is to reference the ORACLE instance where my data is stored using a LIBNAME statement. The first piece of code represents a local LIBRARY reference to the remotely stored ORACLE data. The second demonstrates the DBPROMT= option discussed below. The main reason I can think of to set up the local LIBREF is the same as the reason we used the SERVER= option earlier. It provides a way to look at and move the smaller data tables interactively. The third example shows the preferred method, remotely submitting the library reference to create the ORACLE library as close to the data as possible. Like remote submitting PROC ACCESS in the previous section, we are trying to avoid pulling data through the network until absolutely necessary, i.e., when we have a small enough subset of our data to use PROC DOWNLOAD or REMOTE LIBRARY SERVICES. In case you are wondering the SERVER= option presented in the SAS/CONNECT portion of the paper applies to the REMOTE library engine, while ORACLE in your LIBREF here calls the ORACLE library engine, so we can't combine the two to get a local copy of a remote ORACLE library. Nice thought though. The LIBNAME statement with options for the SAS/ACCESS to ORACLE engine has two distinct advantages over PROC ACCESS. First, by referencing the ORACLE instance (an instance is ORACLE's way of saying LIBRARY) you set up a reference to an entire group of tables at once, rather than having to create a descriptor for each table. Second, by using the DBPROMPT= option you can tell SAS to prompt you for your username, password and path rather than leaving them laying out in open code. (Note: this obviously will not work in BATCH SAS code, nor will it work for a remote library reference, since you won't have access to the resulting prompt locally.) SQL Pass-Through Facility libname dwh1 oracle user=kpd password=stopasking path='dwh1' schema=cpeom; libname dwh1 oracle dbprompt=yes schema=cpeom; rsubmit; libname dwh2 oracle user=kpd password=iwonttell path='dwh1' schema=cpeom; endrsubmit; For those of you familiar with SQL, the code for PROC ACCESS probably looked familiar. That is because SQL queries underlie most of what SAS does with SAS/ACCESS for relational databases. (SAS/ACCESS software for other types of database management systems that do not use SQL to operate on the data stored within them works differently.) If you do not use SQL, don't know how to use SQL, and have no interest in learning SQL, then the SQL Pass-through facility is not for you. You can do pretty much everything you want to do with your DBMS data using PROC ACCESS or the LIBNAME statement, and never have to write any "real SQL code." But if you are going to be working with data with large numbers of observations, or many (50, 100, 250, etc.) related tables, you might want to start playing with SQL. Here is an example of what looks to be a complicated SQL Query (its really not that bad, but I am not teaching SQL today so you will have to take my word for it) that combines information from three different tables in a relational database with over 1 million total records. It produces a count of the total number of clients served per year by county.: rsubmit wait=no; proc sql; connect to oracle (user=kpd orapw=mylipsrsealed path='pwh1' schema=snp); create table querytable as select * from connection to oracle (select dates.year, counties.ctyofres, count(distinct services.recipient) as tot_served from snp.dates dates, snp.services services, snp.counties counties where dates.datekey=services.datekey and counties.countykey=services.countykey group by dates.year counties.cntyofres); disconnect from oracle; quit; endrsubmit; ALL) FROM CONNECTION TO ORACLE statements. These statements are used to leave SAS entirely and run this query from within the ORACLE database itself. SAS then is passed the results of this ORACLE SQL query, which it uses to make the data set QUERYTABLE. This is by far the most efficient way of running a query against a database this large and complicated. It lets ORACLE do the work it was designed to do, and then lets SAS do the rest. This could have been submitted on Unix using a LIBRARY reference for ORACLE such as the DWH2 from my LIBNAME example, but this would have been slower than the query that uses the SQL-Pass Through facility. The query could also have been run using the local LIBREF DWH1, but this would have been by far the slowest option (in the case of queries against HUGE data sets the slowest by HOURS). Also, since this was submitted remotely with the WAIT=NO you can run other SAS procedures locally while this is running on your remote SAS session. The last two lines of code bring us back to SAS/CONNECT. The RDISPLAY and RGET commands are used with the WAIT=NO option to go up to the remote server at a later point in time and pull down the SAS LOG and output printed to the LISTING OUTPUT destination. RGET puts these results into your local LOG and OUTPUT windows respectively, while RDISPLAY opens two more windows to display this output separately. Of these two, I prefer RGET. The reason for this preference being that you can use RGET with PROC PRINTTO to save a local copy of the remote SAS session's LOG and OUTPUT, separate from your local SAS session log. *rdisplay unixdata; proc printto /*Pick one of us not both*/ run; rget unixdata; proc printto;run; *rget unixdata; There are several key points. First to toot SQL's horn a little, notice that it did not require sorting the database to perform by group processing, that it produced a frequency count for me, and that it also essentially produced a report dataset of Total clients served by county and year. Second, what you may not have noticed is probably the most exciting part of this SQL code, the CONNECT TO ORACLE and SELECT * (SQL for log='remote.log' print='remote.lst' new; I haven't figured out a good way to do this with RDISPLAY output, other than to interactively copy the LOG or OUTPUT and then paste it into some other text file for later. Conclusion This paper was intended to present just some of the many ways to use SAS/CONNECT and SAS/ACCESS software, and, within the ways presented, to describe their pros and cons. Hopefully the suggestions CONNECTed with you, and they will serve to make these two valuable packages more ACCESSible to you. References SAS Institute Inc. (1999), Communication Access Methods for SAS/CONNECT and SAS/SHARE software, Version 8, Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999), SAS/CONNECT User's Guide, Version 8, Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999), SAS OnlineDoc, Version 8, Cary, NC: SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Contact Information Please send questions, comments and suggestions to: Kevin Delaney NYS Office of Mental Health 44 Holland Ave Albany, NY 12229 (518) 473-7868 [email protected]