Download The SAS System and Meta Knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Knowledge representation and reasoning wikipedia , lookup

Personal knowledge base wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Pattern recognition wikipedia , lookup

Time series wikipedia , lookup

Transcript
The SAS system and Meta-Knowledge
Gerard HATABIAN - Herve AUGENDRE
Abstract
As explained by Y.DODGE and D.J.HAND in a paper entitled What
Should Future Statistical Software Look Like? published in the November 1991
issue of the Statistical Software Newsletter "Meta-data is information about data
and can obviously be used to help the user of statistical software. More exciting is
the prospect that tl;te computer itself can access the meta-data to control the
analysis in some extent." Meta-knowledge are not only these meta-data but also
the ways to use them.
These ideas have still been explorated by SAS Institute in JMP®. Attributes
are associated with each variables and the statistical platforms adapt themselves
to the attributes. The question is Why not in the SAS System?
The existing possibilities are listed and discussed including these available
in SAS/INSIGHyTM and SAS/LAB®. The SAS announcements concerning the data
and SAS/EIS® will also be taken into account. This is to be balanced with what is
available in other softwares including Statistical Expert Systems.
\
An example
Before defining Meta-knowledge, let's have a small example to see
what it is. The classical "Class" data table contains five variables: name sex
age weight height for the pupils of a classroom. Suppose you want first to
study the distribution of the variables and then look at the relation
between two variables with the SAS system.
To study the distribution, you use the CHART (or GCHART)
procedure to have an histogram. This is all right for sex and height. On the
contrary, with age the six different values of the variable are ignored and
some strange midpoints are used instead. You have to use the DISCRETE
option to obtain a satisfactory result.
How to look at the relation between two variables depends of ... the
variables themselves. You compute a frequency table (with proc FREQ) to
cross sex and age. You fit a line (with REG, or GLM, or even GPLOT) to
look after the effect of weight on height. One-way analysis of variance
(with TTEST or GLM using a different syntax) enables to look after the
effect of sex on weight. Finally the effect of age on height needs some
logistic regression (with LOGISTIC) to be determined.
The two arising questions are : How to make SAS choose the good
options by itself, How to make it choose the good procedure (or method).
Solving the problem
This is one of the examples of JMP®, the SAS Institute product for
589
the Macintosh. JMP is especially efficient to solve the problem.
In JMP, variables are associated with a type or measurement level:
sex is nominal, ageis ordinal, weight and height are interval.
!5 Cols
ri~Dom
40 Rows
1 KATIE
age
sexe
12!F
taBle
591
poids
951
........................................;. ·Lou·is·[·········............·....·..:..............····j·;·:·F................·..··!·................6·j··!"· ..··..·....·j·;3'!....
........................................3' ·J·ANE·..···..·................·......·!........ ·····..·..j·;·!·F·..··..··..·. ··....·!·..·..···. · . ··ssy·. ·. . ·. · . 74·!····
........................................~.....;.~.;:; ..~.~.; .......................... ~ ..................;.:;.~.~ ..................... ~ ................. ~.~.. ~ ...............;.~.~.~....
You choose the analysis platform: "Distribution of Y", "Fit Y by X",
and JMP automatically .choose the good method depending on the
measurement level of the variables.
What are Statistical Meta-Knowledge
The knowledge we used to solve the problem is meta-knowledge. A
possible definition is "additional information to use the data and to be able
to transform it into statistical information". This knowledge can be metadata, strategies, or other knowledge.
Meta-data
Meta-data are data about data. They constitute information
describing the numerical data and its properties. As explicit and accessible
piece of information, they provide a guidance in choosing the meaningful
statistical methods [see HAND].
Bounds on the values that a variable can take enable automatic
validation. Measurement scale (nominal, ordinal, interval, ratio)
determines the sort of analysis it is meaningful to apply. The knowledge of
the links between variables enables cross-validation and eases imputation;
this can go up to the relational data model (domains, integrity constraints,
... ) [see CHURCH].
But meta-data can also be the measurement procedures or the
research design. In fact, it is hard to give a limit to the scope of
information that might be relevant.
Strategies
"A statistical strategy is a formal description of the choices, actions
and decisions to be made whilst using statistical methods in the course of a
study" (HAND). In fact, strategies are two-levels knowledge. They are
knowledge on the generic organisation of statistical processing. And they
also are knowledge of the specific statistical data processing tasks. That is to
say, for a specific method, knowledge on conditions of use, assumptions,
and rules for interpretation.
As an example explicit conditions of use permit either to verify
them for the chosen method, or to select a meaningful method. In the
same way, assumptions have to be verified and might suggest
transformation of data when they are violated.
_ Of courser as explained by HAND, issues of robustness to departure
from the assumptions and of experience gained from practice arise here. It
is putting these-kinds of things into statistical expert systems which poses
the real challenge.
Other meta-knowledge
Explanation ability (of the results, of the reasoning) is an important
feature which rely mainly on textual meta-knowledge such as on;.line
documentation. Ability to interpret graphical output of many shapes and
forms is also an important feature which rely on more technical and less
explorated knowledge.
Then, knowledge on how to use all these pieces of knowledge is
required. A modelling of knowledge and an "engine" able to deal with this
model are also needed. Engine can be an Expert System (Knowledge Based
System), an AF Application or an other program. But to be usable, the
system need some knowledge on the statistical software itself (syntax,
input/ output).
Existing Meta-Knowledge in the SAS System
The SAS System does contain some meta-knowledge. They are to be
examined in the "classical" SAS System and in the "new" products
SAS/ ASSISJ'®, SAS/INSIGHTTM, SAS/LAB® and JMP on the Macintosh.
Meta-data in the SAS System
The SAS System includes common databases meta-data. That is to
say, it handles data tables described by a name, a label and a type (DATA,
CaRR,. .. ). This type is under-used, and the label is not used.
A table contains variables and observations. Each variable has a
name, a label and a type (NUMERIC or CHARACTER). It also may have a
format, a label, ... Let's note that variables' type has been introduced only
for computational reasons.
Thus, SAS contains a very limited set of metadata in which table
type is the only real meta-data.
Meta-data in SAS/INSIGHT, JMP and SAS/LAB
In SAS/INSIGHT, the "Set Properties" function, available in the
Data Window, enables us to define variable's default role (Weight,
Frequency, Label or None), its measurement level (Nominal or Interval)
and of course its name and label. Role and measurement level are real
novelties in SAS but it seems that they are not stored.
These notions were already there in JMP, the Macintosh product
from SAS. JMP is richer with two extra roles X and Y for independent and
dependent variables and an extra measurement level Ordinal - Ratio is
missing but is harder to take into account. These meta-data are stored with
591
the table and guide the analysis.
In SAS/LAB, detection of the measurement level - Nominal or
Interval - for numerical variables is said to be automatic. A role is also
attributed; in fact- a nominal variable is to be an independent variable.
These meta-data ~re stored in a specific structure (the journal), which is
not a table, and guide the analysis.
So, SAS has tried to introduce some metadata, but up to now this
not achieved in the SAS System.
Strategies in the SAS System
The SAS system does not contain real strategies as we defined them.
However it does contain some interesting "behaviour". Thus, existence of
default options balances the clumsiness of the language. Missing data are
excluded from analysis. And these things are certainly key-features of the
SAS System.
However, a same procedure may deal with numerous issues and
the same issue may be dealt by numerous procedures. The defaults options
are not enough. But, in its cautious position, SAS Institute just offers a few
pieces of advice in the "Introduction to ... " chapters of the SAS/STAT®
User's guide. The rest of the documentation is, alas, useless in terms of
strategy. On-line Helps are also useless for this purpose.
Strategies in SAS/ASSIST, JMP and SAS/LAB
As for meta-data, SAS Institute added some kind of strategies in its
new products.
SAS/ ASSIST includes knowledge of the SAS syntax but nothing
more. User must choose the meaningful method and know how to read
the SAS outputs and interpret them.
In JMP, automatic choice of the meaningful method has been
introduced. So when one wants to "Fit Yby X", JMP selects automatically
between contingency table, one-way anova, simple linear regression or
logistic regression depending of the measurement levels of the X and Y
variables. Furthermore, several graphical tools have been added to help
interpretation. But the clarity of the first version has been shadowed by the
incoherent introduction of new methods and options.
In SAS/LAB (see TOBIAS), deep knowledge on a few methods has
been introduced. In the regression and ANOVA domain, SAS/LAB checks
assumptions, suggests transformation of data and provides tools to help
interpreta tion.
So, there is some hope to find more real strategies in the SAS
System in the near future.
Explanation ability
The SAS System does not provide explanation. It computes
everything it can and lets users extract important pieces of information
592
and interpret them. JMP and SAS/INSIGHT do not give more explanation
but they have adapted computation and graphical aids .
. On the contrary, SAS/LAB provides natural language explanation
such as: "There is a strong statistical evidence that an increase in WEIGHT
is associated with an increase in the expected value of HEIGHT". It also
includes graphical summarization and gives suggestions for what to do
next. So the user is guided even if he has to know the statistical and
technical terms.
What have been added in the SASSystem by users
Some users tried to use or even to enhance the meta-knowledge in
the SAS System.
Systematic use of the available meta-data in SAS (usirtg proc
CONTENTS and FORMAT) has been studied by PODGURSKI. The new
announcements of SAS. during the SEUGI'91 will give new means. There
are a data dictionary (SASHELPSQL) and a meta-data manager included in
SAS/EIS®. These solutions are to be tested when possible. The meta-data
manager might be an excellent solution if it is available everywhere in the
SAS System.
Statistical Expert Systems (S.E.S.) using the SAS System
Statistical Expert Systems (S.ES.) using the SAS System have been
developed by many users (see HATABIAN). Here under are listed four
SES which use SAS as statistical package:
• WAMASTEX (cf DORDA)
Developed in Austria for the statistical needs of clinical statisticians,
it includes the strategy for a study.
• DEXPERT (cf LORENZEN)
Developed at General Motors, DEXPERT is specialized in the design
of experiments. It helps to create a design and then to process it.
• DAEDALUS (cf DARIUS)
Developed in Belgium, it includes the strategy for a study. It is
oriented on the ANOVA models. The most interesting fact is that
DAEDELUS is entirely in SAS. It uses TAXSY, an inference engine written
in SAS/ AF's SCL.
• ADELE/ESIA (cf AUGENDRE)
Developed in France for surveys' analysis, it includes a modelling of
meta-data and strategies.
Note that the two first systems are said to be used.
Existing Meta-Knowledge in other statistical softwares
What about meta-knowledge in the statistical softwares of the PC
and UNIX world? A small selection of four representative softwares
593
follows.
In the PC world
Statgraphics® (Uniware) offers an interface very similar to
SAS/ ASSIST. Cop.cerning meta-knowledge, there is nothing more than
SAS. However it includes a very simple data dictionary.
SP AD® (CISIA - France) is a software specialised in multivariate
analysis. It contains a data dictionary used for validation and coding of
data. Explicit measurement level is required to perform analysis and some
grouping of methods are described in the documentation.
In the UNIX world
RS/EXPLORE® and RS/DISCOVER® (BBN Software) [LANE] might
be seen as an interesting alternative to SAS/LAB. With them, you have a
menu-driven interface, with highlighting to indicate recommended
choices, an on-line glossary for statistical and technical terms, help
messages that describe not only how to use menu choices but also what to
look for in the output from those choices, interpretation of graphs, tables
of statistics, and other types of output, recommendations on the proper
analysis method, a diary to organise a record of the analysis.
S (Bell Labs) is a pure UNIX product. This statistical language is very
interactive (functions) and contains an object-oriented data-modelling. In
S everything is data, even the programs themselves. S is the basis of many
SES and especially the most well-known REX (see GALE).
What should be added in the SAS System by SAS Institute
Many knowledge may be added to the data to help the analyst in its
day-to-day work. Some already exist in the SAS System and others are
announced but upgrades still need to be made.
First, the data structure is to be modified or an open data dictionary
is to be provided. Are the SQL Data dictionary and the EIS meta-data first
steps in this direct~on?
Then, the existing procedures are to be enhanced to use these new
information. By the way, more· comprehensive outputs would be
welcome. The new OUTPUT procedure is a tool but once again, it will rely
entirely on the user decisions.
An important work is also needed on the on-line documentation.
The version 6 help facility is really not satisfactory. Both a more context
sensitive and a more general help are needed.
Finally, the most important is to ensure the consistency of the
different products and especially SAS/LAB et SAS/INSIGHT and
SAS/STAT with Basic SAS.
References
AUGENDRE, H., and aI., 1991 "How to achieve SES: Proposals for more general systems",
594
!"
Preliminary Papers of the Third International Workshop on Artificial Intelligence
and Statistics, Fort Lauderdale, FLO
CHURCH, Lewis, 1991, "Starting SAS Data Sets for Use with the SQL Procedure", SUGI'91
DARIUS, P. L., 1990, "A knowledge-based environment for the statistical management of
experimental d~ta", Proceedings of the 9th COMPSTAT, Dubrovnik, YU
de FEBER E., and aI., 1991, "Task Analysis and Domains in Statistical Data Processing",
Deliverable 1.1 of the DOSES Project D41 Modelling Meta-Data
DORDA W., and al., 1990, "WAMASTEX - Heuristic guidance for statistical analysis",
Short communication of the 9th COMPSTAT, Dubrovnik, YU
HATABIAN, G., and al., 1991, "Expert system as a tool for statistic: a review", Applied
Stochastic Models and Data Analysis, volume 7, number 2, June 1992, Wiley
HAND, David J., 1991, "Measurement Scales as Metadata", Preliminary Papers of the
Third International Workshop on Artificial Intelligence and Statistics, Fort
Lauderdale, FLO
HAND, David J., 1992, "AI in Statistics", Proceedings of the Workshop New Techniques
and Technologies for Statistics, Donn, February 1992
LANE, Thomas, 1991, "Computer guidance and interpretation in statistical analysis",
Preliminary Papers of the Third International Workshop on Artificial Intelligence
and Statistics, Fort Lauderdale, FLO
LORENZEN, Thomas J. and al., 1991, "DEXPERT -- An expert system for the design of
experiment", Preliminary Papers of the Third International Workshop on Artificial
Intelligence and Statistics, Fort Lauderdale, FLO
PODGURSKI, John, 1989, "A meta-database implementation strategy for documentation
and retrieval using the SASSystem", SUGI'89
TOBIAS, Dr. Randall D., 1991, "Guided Data Analysis Using SAS/LAB Software",
SEUGI'91
Herve AUGENDRE & Gerard HATABIAN
EDF - Etudes et Recherches
I, Avenue du General de Gaulle
F-92141 Clamart Cedex
Email: [email protected]
SAS software, SAS/ AF, SAS/ ASSIST, SAS/EIS, SAS/LAB, SAS/STAT, JMP are registered
trademarks and SAS/INSIGHT is tra.demark of SAS Institute Inc., Cary, NC, USA.
SPAD.N is registered trademark of ClSIA, FRANCE
Statgraphics is registered trademark of STSC Inc., USA
RS/Explore and RS/Discover are registered trademarks of DBN Software Products
Corporation, USA
595