Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AN INTELLIGENT INTERFACE TO THE SAS@ SYSTEM FOR USE BY PRODUCT DEVELOPMENT ENGINEERS Debra D~ Spencer. Sandia National Laboratories Introduction complex, and varied database. Until recently, the only access to the data was by batch processing of card decks; these retrievals and routine analyses were run by clerks supported by a programming staff. The data and programs were then moved from the batch environment to VAX/VMS, with the SAS System and a few other commercial data analysis products available. An intelligent interface to the SAS System for a limited set of data analysis needs is being developed at Sandia National Laboratories for use in a VAX/VMS environment. An existing data retrieval and analysis system has recently been updated and made interactive, with the intent that non-statistician users (mostly engineers) become able to satisfy their simpler analysis needs themselves using the SAS System. Since early attempts to make the SAS System accessible to all users have not been very successful, an experimental, intelligent interface to the SAS System, tailored to the particular needs of this group of users, is being developed. The interface has three modules of knowledge: data manipulation, graphics, and routine statistics. The old batch programs were made interactive, and the system is gradually evolving to a more friendly environment. The major functions, in addition to SAS capabilities, are data retrieval, subsetting, display, contents information (e.g. how much data, what variables are available), and some analytical choices. An important element of developing such an environment is the provision of a very friendly analysis tool. The SAS System was chosen as the analytical basis for this system due to its statistical power and because several potential users were familiar with SAS programming. This paper describes the target users and their environment, the problems they have had using the SAS System, and the intelligent interface under construction. The first "friendliness" efforts were directed toward providing easy conversion to a SAS data set and an environment in which to create and run SAS procedures. Next a library of canned SAS procedures was provided for the user to select, to modify using SAS macro variables and commented-out options, and to run. Description of the User Environment Sandia National Laboratories, a diverse engineering laboratory, is a non-profit, non-fee prime contractor for the Department of Energy. Sandia has many data needs, some of which are met by data stored in its Product Data System. The users of product data are generally either component design, reliability, or quality assurance engineers. However, only a handful have used the canned SAS programs which have been provided. One problem has been the lack of a full SAS Macro facility for VMS. In addition, most users have been frightened away the first time an error occurred; the SAS Log file, offered as an option on the menu, has not been a satisfactory means for problem identification by the user. One recent improvement is the addition of VMS FMS screens for user completion of the "canned" SAS procedures. Data engineers specify data transmission requirements from production facilities and ensure that all data analysis needs can be met. The data can be complex. Both scalar and timeseries data are available in the system, representing both development and production testing of many different types of products. One task of the Product Data System Management and Development Division at Sandia is to develop and support a computer environment in which hundreds of laboratory engineers can retrieve data and perform routine analyses, including: Another longer-term approach to the problem has been to investigate current artificial intelligence techniques and to begin development of an intelligent interface to the SAS System. Its goal is to provide basic product data analysis skills to the laboratory engineering community, leaving the harder analysis problems for the statistical consultants. Calculating descriptive statistics Graphing data, generally a plot or a histogram Printing all or part of the data Comparing two or more classes Performing very simple linear regression Problems Engineers Have Using the SAS System For less routine analysis needs, another Sandia division provides statistical consulting. Since too few statisticians are available to do all the analyses that should be done, the need to provide electronic help with more routine analyses has become more and more obvious. Only a small percentage of non-statisticians have both the data analysis skills and the computer skills to easily use software such as the SAS System, since extensive training is required to be either a statistician or a computer programmer. From watching novice SAS users over the past two years, several problems non-statist~c~ans have, even in- a "canned" environment, have been identified. These are: Data have been stored in this database for more than twenty y,ears, creating a very large, 900 Organizing and cleaning the data Learning the SAS programming language, at least enough to identify problems Identifying the appropriate statistical technique Locating the chosen statistical procedure within the SAS System's terminology Understanding the statistical assumptions involved in the procedure Reading and interpreting the results Figure 1 data reorganizations or modifications are: Using multiple SAS datasets for the analysis Creating classes from a continuous variable Properly specifying a date value Extracting a value from within another value Making a class variable more descriptive PROG SORT DATA-DATASET5; BY SERIALNO; DATA; SET DATASET5; In the first of these areas, organizing and cleaning the data, an initial graphical look at -the data should invariably be taken. This is typically translated into a request for a histogram. However, the user might be better served with an x-y plot, a series of side-byside boxplots, or a bar chart with subgroups. Outliers should be identified and investigated for validity. Specific data collection methods can allow unique problems to creep into the data, such as tester malfunctions or multiple test attempts which may need to be combined in either a last-reading-recorded or final-attempt manner. The better the job of up-front data planning is done, the simpler this step in the process becomes. BY SERIALNO TEMP; RETAIN DIFF_AE; IF FIRST. TEMP AND AGE IF LAST. TEMP THEN DO; DIFF AE OUTPUT; In conjunction with data organization, the user must choose the statistical strategy. This means that the user must decide which statistical technique is available to answer the question: regression, one-way analysis of variance, estimation, etc. In general, the nonstatistician user can not be expected to do this except for a small, well-defined set of routine questions. After choosing the statistical strategy, locating the routine within the SAS System is not always easy or obvious. For example, using PROC UNIVARIATE for a matched pairs t-test is not obvious. Choosing the correct regression or analysis of variance procedure can be confusing. If the user successfully executes a SAS program, implicit assumptions, of which the user is unaware, may have been made. The most classic is the assumption of normal data. Is the user aware that some data are better described in terms of medians and interquartile ranges than in terms of means and standard deviations? Whether group variances are equal or how small the sample size is often need to be considered. Will the regression user look at the residuals or just get an "answer"? Does the use~ know what to do if classical parametric techniques are not justified? SET DATASETl DATASET2; LENGTH TEMP $ 13; 'NEW'; IF TEMP IF TEMP IF TEMP FIGURE 1. 'R' THEN TEMP 'L' THEN TEMP 'H' THEN TEMP - Figure 2 involves extra steps requiring knowledge that the data must be sorted first and knowledge about how to create one observation from two observations in- order to do a matched pairs t-test. DATA; 'OLD'; DIFF AE - AE; FIGURE 2. Example of Data Organization: Preparing For a Matched Pairs T-Test Modifications to reorganize the data are often necessary since different questions require different organization of the dataset. The following two examples show typical steps which must be taken to arrange the data properly. < 'OlJAN72'D THEN AGE GROUP; - END; Canned routines do not always work the first time. One example is a variable name erroneously defined to contain a nonstandard character such as a hyphen. For assistance, the user must read the SAS log unless programming has been supplied to catch this problem. Most novices do not know the log exists or do not understand it. At any rate, occasional, nonstatistician users do not retain enough SAS programming skills after taking the SAS Basics cl-ass to easily modify an erroneous SAS program, and even the best of syst-ems will not be able to take all eventualities into consideration in a canned environment. IF TSTDAT > 'OlJAN85'D THEN AGE TEMP ~ SUBSTR(TSTCOD,2,1); 'OLD' THEN RETURN; KEEP DIFF AE SERIALNO AGE TEMP Once users are satisfied with the data integrity and ready to proceed with the analysis, they can construct SAS programs or, perhaps, modify existing canned routines. Either requires, to some extent, learning to program in the SAS System. IF TSTDAT = IF LAST.TEMP AND AGE - 'NEW' THEN RETURN; IF FIRST. TEMP THEN DIFF AE ~ AE; 'ROOM'; 'LOW'; 'HIGH'; Can the user read SAS printouts, which were designed for statisticians who know both what to expect and the standard terminology? Can the user look at a PROG UNIVARIATE printout and determine whether the data is normal? What are Example Data Reorganization Program 901 SGN RANK, BOXPLOT? KURTOSIS, GSS, a Several issues were identified as important in the development of an intelligent interface for the SAS System. These included: STEM LEAF, and a If users can read the SAS printouts, can they reach a valid statistical interpretation of the results? Figure 3 displays the results of a PROC TTEST program. Can the user detect a significant difference between the two groups? User friendliness Modularity of the pieces of the interface Flexibility of the interface to handle unexpected difficulties Expandability of the system to include additional knowledge of how to use the SAS System to do addition~l routine analysis Integration with existing software N MEAN STD DEV STD ERR MIN lACE MAX I 0.387 -60.8 -52.81 INEW 20 -58.435 1. 728 IOLD 20 -60.070 1.441 0.322 -63.3 -56.91 I I I I VARIANCES T DF PROB>ITII I 3.249 36.8 UNEQUAL 0.0025 I I EQUAL 3.249 38.0 0.0024 I I I I I I FOR HO: VARIANCES ARE EQUAL, F 1.44 I I WITH 19 AND 19 DF PROB>F 0.43 5 1 I I I FIGURE 3. Can the User Interpret the Results?: The issue of integration with existing software became the driving force behind the eventual system design. The artificial intelligence component, besides knowing what analysis to do, how to do it, and how to interpret the results for the user, needed to interface with the existing software and with the SAS System. This translated for VAX/VMS into the need for an interactive environment in which to communicate among DCL {Digital Command Language) command procedures, the SAS System, and the language in which the artificial intelligence part of the system was coded. PROG TTEST Does the user know what answered in the printout Figure 4? The most natural language choices for this part of the system were either VAX/G or Lisp, since artificial intelligence tools for the VAX/VMS environment tend to be coded in one of these. A large rule-based knowledge engineering tool written in C was chosen. Several problems have arisen with regard to the adaptability of the tool to the task at hand, but extra C coding, while time~consuming, is capable of solving most difficulties. For this reason, it is more appropriate to regard the intelligent interface as being programmed in G, instead of a shell. The benefit of the shell was that it provided a starting point in one of the standard languages for VAX/VMS. question is being from PROC ANOVA in I DEPENDENT VARIABLE: AF AF I I SOURCE DF SUM OF SQS MEAN SQ F VALUE IMODEL 1 8712.83 8712.83 0.99 I ERROR 38 333350.94 8772.39 I CORRECTED 39 342063.77 I TOTAL PR>F R-SQ C.V. I 0.3253 0.025 65.9324 I ROOT MSE AF MEAN I 93.66 142.06 I I I SOURCE DF ANOVA SS F VALUE PR>F 1 8712.83 0.99 0.3253 I ACE I FIGURE 4. The code currently incorporating the interface is partially written in the tool language. It also has many C routines with calls to both the C Run Time Library, the VMS Run Time Library, VMS System Services, and (in the future) either Curses, the Vax C Screen Management Package, or FMS, the VMS Forms Management System, to develop a more visually attractive and user-friendly interface. Can the User Interpret the Results?: PROC ANOVA Description of The Intelligent Interface The intelligent interface is a menu option in the retrieval/analysis software. The interface can eventually be expanded to envelope the entire retrieval/analysis system. Once the (current) interface option is chosen, an identical environment is created in a subprocess which hibernates awaiting SAS programming chores. The subprocess performs each request that it receives and then returns control to the parent process. VMS Run Time Library Routines are required to establish the subprocess environment and the links between the two processes. Since using the SAS System as the basis of the retrieval/analysis system without a sophisticated user interface attracted few nonstatistician users, the decision was made to build an intelligent interface to the SAS System which would answer about 90% of the questions currently asked by the data users. This decision was made after a lengthy study of both artificial intelligence techniques and tools and the analysis requests being made by users and after a study of whether sufficient problems amenable to artificial intelligence techniques existed to justify developing an in-house specialty in artificial intelligence applications. This criterion was dictated by the shortage of AI professionals in the job market. The parent process executes procedure which: Creates the subprocess 902 a DCL command Writes files, acceptable to the tool, containing information about the data set, using SAS procedures such as PROe CONTENTS Loads this as initial dynamic knowledge into the knowledge base system by utilizing the command line parameter feature of C Runs the C program which is the extension of the tool Most of the user communication, once the system and user agree upon a particular type of graphic, will be done via forms. The Data Manipulation Module The data in this standardized data base is neither necessarily stored in the form needed for every eventual analysis nor converted into a SAS data set by a statistician with a clear understanding of the SAS System and statistics. Data manipulations (extra initial SAS DATA steps) are frequently necessary. For this reason, early in the development of the statistics module of the system, the need for a data manipulation module was identified and addressed. The C program determines and constructs SAS procedures necessary for rearranging data, determining normality, and other operations, then returns control to the subprocess which is waiting for a job. The SAS program outputs the results in as friendly a way as possible, which is sometimes extremely friendly and sometimes not, depending upon the particular SAS procedure in use. The C program reactivates when control is passed back, reads the returned information, reasons about it, informs the user, and continues with the session. To determine an appropriate knowledge representation for the expert knowledge about data manipulation in the SAS System, a technique called conceptual graphing [Sowa 84] was used to organize and understand the task needs before incorporation into the commercial knowledge base tool. Various objects and actions were identified which captured the nature of the task. The initial set of generic objects upon which various actions can be performed include: Passing information between the subprocesses is not always easy. Files must be carefully constructed. The SAS System does not always allow for direct transmission of results, which means C programs must be constructed to locate the information in the more typical SAS printout. The interface is being constructed in independent modules which can call each other. The current modules being developed are: The dataset itself Each observation Each variable Each slot for a data value (the intersection of observation and variable) A graphics display module to aid the user with the visualization of the data A data manipulation module, since this step seems to be necessary most of the time A statistics module which helps the user to correctly do the simplest analyses of the data The initial set of actions which can be taken on the objects (not all of which make sense for each object) are: delete rename split The current status of the project is that the basic structure of the system has been determined, coded, and tested. The data manipulation module and the analysis module are partially coded, although much functionality remains to be added. change create select rearrange transpose replace sort join These objects and actions were coded in the appropriate structures of the knowledge base tool, and the supporting attributes and required functionality were identified from the conceptual graphs. As an example, consider what change means for each of these objects. To change a dataset means to specify another dataset with which to work. This choice indicates a reinitialization of the intelligent interface, a restart. The required attributes are the name of the new SAS data set and a double-check before proceeding. The required functionality is redefinition of the appropriate global variables (one of which will recall the interface upon completion) and termination of the current program. The GraphiCS Module The graphics module will provide appropriate displays of the data to the user. The most simple option will be a listing of all or a subset of the data. A histogram will provide a look at a single variable in isolation. Outliers on histograms will be identified to the user and suggestions made. Relationships between variables can be seen through x-y plots, side-by-side boxplots, or grouped/subgrouped vertical bar charts, depending upon the continuity of the variables. PROe GREPLAY will be incorporated to allow users to arrange multiple graphs on a hardcopy page or to size graphs. Titles, footnotes, labels, and symbols can be specified by the user. Regression curves on graphs will be available. To change an observation means to replace all or some of the values for that vbservation. Since observations don't have names (observation numbers don't mean much in this context) observations must be identified; identification is _made through serial number, test code, test date, and attempt number. Only as much of this information as necessary to uniquely identify 903 the observation need be specified. Once the observation is identified, the appropriate values can be changed by filling in a form. The default for changing observations is by explicit specification of the values, but meaningful functional changes can also be supported. the subprocess and the results presented to user for consideration. the One reason for choOSing simple comparison of two or more things is that the logic for choosing the statistical procedure is better defined and agreed upon than for many statistical tasks. Flow charts of the general logic can and have been drawn [Wall 86]. In addition, such comparisons are frequently requested by the users of the system. To change a variable means either to change some of the attributes of the variable or to replace the values for the variable. Since variables have names, identification of which variable to change is straightforward. The default for changing values of a variable is by a function. Knowledge of certain SAS functions will be incorporated into the system. There are four general types of data (that is, four basic scales of measurement- for data), commonly referred to as ratio, interval, ordinal, and nominal. For the statistics module, the first two are lumped together into numeric; the other two are called rank (or ordered classification) and non-ordered classification, respectively. Knowing the correct measurement scale is Vitally important in knowing what kind of analysis to do. The intelligent interface provides the initial data type labels by examining the output from SAS Proc Contents. To change a value means to replace the data value with another value or missing. This category is a generalization of both changing a variable and changing an observation. To identify the value, both the corresponding variable and observation must be identified as noted above. A value can be changed either explicitly (the default) or through a function. Multiple values, and hence variables and observations, can be changed. Complex specification of which values to change could involve, for example, a range of values for a variable (perhaps changing all the outlying values of variable X to missing) or all data values of -99 which might indicate malfunction of a particular tester. Screens or forms will be designed for facilitating the entry of complex information. The first step in the interface's quest for simple comparison of two or more things is the identification of the things for comparison. This, in terms of the structure of the SAS dataset and thus the SAS analysis program, can be more complicated that it sounds. For instance, in a before and after comparison, storage of the two comparison quantities in separate observations of the same variable name is quite different (with respect to SAS knowledge of how to do the comparison) than storage as two separate variables in the same observation. Changing observations, variables, or values all result in the creation of a SAS program which is executed to create a new SAS data set. As a second example, deleting a data set means doing a VMS delete of the data set, after confirming the user's intentions. Only the name of the data set is required. Deleting an observation requires the same identification as above. Deleting a variable means knowing the name of the variable. Deleting a value is identical to changing the value to missing, which was handled above. Again, both simple and complex specification of which values, variables, or observations is possible. In addition to determining what and how many to compare, the program must determine the measurement scale, and whether independent or dependent samples are being compared. The system will provide the user with descriptive statistics and call the graphics module to obtain introductory graphics. Perhaps a transformation might be suggested by either the system or the user. For numeric data, the system checks for normality and homogeneous (equal) variances, if appropriate, in order to help determine the appropriate analysis routine. The system might recommend that the user move to a lower measurement scale, such as ranked data, or try some other transformation. As a third example, renaming a dataset or a variable is straightforward; both the new and old names of either are required. Renaming a value or an observation doesn't make sense since neither of these has a name; therefore, the system suggests to the user that "variable" was intended rather than "value" or "observation". Once such parameters are determined and reshuffled as appropriate, the system looks for rules to determine the appropriate analysis method. There are 11 different analysis recommendations that the system currently knows. Examples of these are: The Elementary Statistics Module The initial statistical strategy in the knowledge base is limited to the task of simple comparison of two or more things. The system will execute programs in the subprocess to answer questions such as: "Is the data for variable _ X normally distributed?" The recommended comparison analysis is performed in For two independent samples of numeric data where the tests for both equal variances and normality are passed, the system recommends a t-test for 2 independent 904 variables and calculation limits on the difference. of Systems: Future Statistician, 39, confidence For more than 2 dependent samples of non~ ordered classification measurement scale, the system recommends a chi~square test of independence in the contingency table. The American HaKong, L. and Hickman, F. R. (1985) "Expert Systems Techniques: An Application in Statistics." Polytechnic of the South Bank, London, 43-63. Hand, D. J. Systems: Design." 369. If the system is unable to recommend one of these 11 choices, it informs the user that the problem is outside the scope of its knowledge and refers the user to a human statistical consultant, providing a name and phone number. (1984) "Statistical Expert The Statistician, 33, 351- Hand, D. J. (1985) "Statistical Expert Systems: Necessary Attributes. Journal of Applied Statistics, 12, 19-27. Others around the world are also addressing the problem of an artificially intelligent system to help the user with various kinds of statistical data analysis. My assessment of the efforts is one of limited, but increasing levels of success. One reason is that the diagnostic protocols of the statist1c1an are largely uncharted. A major benefit offered to the field of statistics by the new field of artificial intelligence is the challenge to analyze, improve, and better understand accepted statistical strategy. Several references are given in the bibliography to other work in this combination of the two disciplines of statistics and artificial intelligence. NeIder, J. A. (1977) "Intelligent Programs, the Next Stage in Statistical Computing." Recent Developments in Statistics, J. R. Barra et al., ed_, North-Holland, 79-86. Oldford, R. W. and Peters, S. C. (1984) "Building a Statistical Knowledge Based System with Mini~Mycin." MIT Alfred P. Sloan School of Management, Technical Report No. 42. Portier, K. M. and Lai, P. "A (1983) Statistical Expert System for Analysis Determination." Proceedings of the ASA Statistical Computing Section, 309~311. Pratt, A., Marti, M., and Catot, -J. M. (1985) " An Inference Between a Non-Expert User and Statistical Systems: The STATXPS." lACS, 45th lSI-SESSION, Amsterdam, 367-368. Summary In summary, the non-statistician community can not realistically be expected, in mass, to use a sophisticated statistical analysis product such as the SAS System by themselves. One approach is to build an artificially intelligent interface to assist the non-statistician user. To do so requires intricate analysis of the needs of the group of users being addressed by such a system, as well as sophisticated software which takes much time to develop. * Directions." 1~l6. Pregibon, D. and Gale, W. A. (1984) "REX: an Expert System for Regression Analysis." Proceedings COMPSTAT 84, 242-248, Prague, Czechoslovakia. Smith, A. M. R., Lee, L. S., and Hand, D. J. (1983) "Interactive User~friendly Interfaces to Statistical Packages. The Computer Journal, 26, 3, 199-205. and VMS are trademarks of Digital Equipment Corporation. SAS is a registered trademark of SAS Institute, Inc. VAX Sowa, J. F. (1984) Information Processing Addison_Wesley. Wall, F. J. (1986) Statistical Handbook. McGraw~Hill. References: Chambers, J. M. (1981) "Some Thoughts on Expert Software." Proceedings of the 13th Symposium on the Interface, 36-40. D. D. Spencer Sandia National Laboratories P. O. Box 5800 Division 2825 Albuquerque, NM 87185 (505)-844-3847 Chambers, J. M., Pregibon, D., and Zayas, E. R. (1981) "Expert Software for Data Analysis, An Initial Experiment." Proceedin,gs of the 43rd Session of the lSI XLIX, Buenos Aires, 294~303. Artificial (1986) Gale, W. A. (ed) Intelligence & Statistics. Addison-Wesley. Hajek, P. and Ivanek, J. (1982) "Artificial Intelligence and Data Analysis." COMPSTAT 1982, Physica~Verlag, Conceptual Structures: in Mind and Machine. 54~60. Hahn, G. J. (1985) "More Intelligent Statistical Software and Statistical Expert 905 Data Analysis