Download Data Warehouses and Bayesian Analysis - A Match Made by SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Data Warehouses and Bayesian Analysis - A match made by SAS
Ramanan Gopalan, Data lnfoworks, Inc., Sunnyvale, CA
ABSTRACT
Data warehouses have the advantage of
storing and retrieving large amounts of data
in a coherent and efficient manner.
Bayesian analysis uses large amounts of
data either from outside sources or
previous analyses. SAS is probably the
only software environment that allows
seamlessly connecting data management
Bayesian
and
warehouses
through
analysis. In this paper, I will demonstrate
how data warehouses can be used as a
tool in tracking prior data and posterior
estimates in a Bayesian data analysis
model for linear regression. SAS/AF will be
the front-end tool with PROC IML
performing the Bayesian Regression model.
Key words:
regression
prior,
posterior,
linear
INTRODUCTION
information
make
warehouses
Data
available for fulfilling decision support
enterprise.
an
throughout
needs
warehouse
data
the
Consequently, building
is a business process that involves
technology to the extent that it can facilitate
achieving the enterprise requirement. The
true benefits of a data warehouse are
realized only when it is exploited using
statistical tools.
One of the most commonly used tools is the
This
decision support system (DSS).
usually comprises mostly of summary
tables and slices of the data to aid in
immediate decision making on a daily or
frequent basis. For instance, a car dealer
may want to know the number of cars in a
particular model he sold in the last month.
The DSS will contain a report for number of
cars sold broken down by model. In many
data warehouse applications, users may be
satisfied with just receiving such reports
Usually no further data
from a DSS.
analysis is performed.
Recently, there has been a lot of interest in
data mining - analysis techniques used to
uncover patterns in the data that were not
anticipated or were hidden from view.
Neural networks and CART are two of the
techniques widely used in data mining
software such as the SAS Enterprise Miner.
Time series analysis and other trending
techniques are also used to reveal any
dependence on time factors.
Bayesian analysis of data allows the users
to incorporate their subjective beliefs into
the statistical analysis. These beliefs can
either be based on previous data
(databased priors) or opinions about the
value of the parameter being evaluated
(judgemental priors). In either case, the
priors will have to be evaluated and
formulated before any further analysis
based on them can be done.
In this conceptual paper, I will explain with
details Bayesian analysis, prior and
posterior distributions and how the SAS
system can be used to evaluate priors and
present posterior analysis in a data
warehousing environment.
291
Statistical Inference
The main purpose for building a statistical
model for given data is to infer findings to a
more general population from which the
data is a sample. All the information from
given data can be summarized by the
likelihood function. There are two schools
of thought on how an inference can be
made. The classical or frequentist school
assumes that the model parameters are
fixed but unknown and the data given is
only one among many that can be obtained
from the model. Conclusions are made
using p-values and significance levels. The
p-value tells us how likely it is that similar
data can be generated by the underlying
The
model under identical conditions.
smaller the p-value the more unlikely that
the underlying model generated the data.
Bayesian inference on the other hand
assumes that the model parameters are
only
i.e.,
random,
and
unknown
probabilities can be assigned to values of
the unknown parameters. Conclusions are
based on the posterior distributions of the
the
contain all
which
parameters,
after
information about the parameters
This approach is
observing the data.
appealing to many practitioners because it
does not assume repeated sampling and
allows the user to incorporate their own
beliefs into the inference.
Prior information is used sometimes in
frequentist inferences. One is supposed to
accept or reject a null hypothesis not only
based on the p-values obtained but also
any knowledge a person may have about
the inferential problem at hand. This is why
in many instances, statistical significance
does not lead to practical significance and
vice versa.
The Bayesian approach allows a user to
formally incorporate prior information in the
292
form of a distribution. After combining with
the data, we obtain the posterior distribution
that is really a weighted average between
the prior and the model likelihood.
Probabilities derived from the posterior
distribution (posterior probabilities) are used
to make judgements about values of a
parameter given the data.
Bayesian Analysis
All the information needed to make
inferences in Bayesian analysis can be
found in the posterior distribution. Bayes
theorem elegantly gives the posterior
distribution as:
Prior X Likelihood
In many simple cases the posterior.
distribution is easily obtained as a form of
some known. distribution. In other more
complex case, it may be hard to evaluate
the posterior distribution directly. Recent
advances allow for numerical evaluations
using Monte Carlo Markov Chain methods
(See Gelman, Carlin, Stem and Rubin,
1995). There are other texts that deal with
different aspects of the Bayesian paradigm,
ranging from philosophical issues about
probability to application methods (See
Berger, 1985 and Zellner, 1985).
Prior Distributions
In any analysis, once the statistical model is
chosen, the likelihood is also determined
In
using distributional assumptions.
additional
the
is
there
Bayesian analysis,
step of choosing a prior distribution.
When there is no information available
about a parameter, usually a noninformative prior distribution is chosen.
These priors do not contribute any weight to
the posterior. On the other hand, when
information is available about a parameter
of interest in the form of previous analysis
performed or data available, informative
priors can be formed. When the prior and
posterior distributions belong to the same
family of probability distributions, the prior is
called a conjugate prior. For instance,
when the prior is a beta distribution and the
likelihood is a binomial distribution, the
posterior is also a beta distribution. In this
case the beta distribution is a conjugate
prior for the binomial likelihood.
identified. During transition, all the data
sources corresponding to the response and
independent variables are identified. In
addition, data sources for the prior as well
as destination for the posterior analysis
must be determined as well. Implementation involves constructing the front-end
screens and the back-end programs that
perform the Bayesian analysis.
In a typical Bayesian analysis, the posterior
distribution is evaluated for many priors (at
least 2 - the non-informative and one
informative prior) for comparison and
robustness. Since the informative prior
involves processing data, the data
warehouse model is one of the methods
suitable not only for constructing the prior,
but also calculating the posterior distribution
and tracking all the meta information
relating to the data sets.
In a normal linear regression model, we
have the dependent variable Y, as a n x 1
vector of observations, the independent
variables X, as a n x k matrix of values and
~ as a k x 1 vector of parameters. The
model is specified as:
Data Warehouses and OLAP
Building a strategic data warehouse
phases:
three
involves
application
conceptualization, transition and implement
-ation (Welbrock, 1998). At the concept
stage, all the business questions that the
must answer are
data warehouse
determined and the metadata {information
about data) is created. During transition, all
the data sources are identified, cleaned,
summarized and stored in the warehouse.
In the implementation stage, the OLAP is
built that allows users to view different
aspects of the data at any level and obtain
DSS reports.
Bayesian analysis in the data warehouse
environment will closely follow the three
phases of construction. Since one of the
main uses for any statistical analysis is
prediction, during the concept stage,
response and independent variables along
with the priors and likelihoods must be
Bayesian Analysis of Linear Regression
Y=XJ3+e
Where e is a n x 1 vector of error terms
having the normal distribution with mean 0
and variance c?-. In a Bayesian analysis the
parameters ~ and cr2 are assumed to be
random. The likelihood function is given by:
L{f3,cr2 1X,Y) oc
Exp[-0.5 (Y- X f3)'(Y- X ~)/cf2 ]/ crk
Jeffreys prior is a non-informative prior that
can be imposed on both ~ and d, so that
the posterior distribution for these
parameters are:
1 2
P(~lcf2, X)= N[o, {X'Xr cr
P(cr21X)
Where
]
=IG[(n-k)/2, (n-k)s2/2].
o = (X'Xr1X'Y,
s = Y'[l(n)-X(X'Xr1X]Y/{n-k),
2
N is the normal distribution and IG is the
Inverted gamma distribution. The posterior
293
distribution for a p x 1 vector of predicted
values, Z, is given by:
P(ZIXz,X,Y) =T[Xzo, V, v]
where T is the Student's t distribution, V =
vs 2{1(p)+Xz(X'X)"1Xz'}/(v-2) and v = n-k.
One of the informative priors of interest is
the conjugate prior given by:
In the transition phase, all the data sources
in the warehouse that will contribute to the
dependent and independent variables must
be identified. Here metadata and source
data registration in the warehouse become
crucial steps.
Separate tables are
necessary for storing prior and posterior
information that allow for updating. This
feature is somewhat unique, because
updating does not usually take place in a
data warehouse environment.
P(~lci) = Nk[J.I,cr2E]
2
P(cr ) = JG[r/2, W/2]
The
posterior
para_meters are
distributions
P(f3ld,X, Y) =
N[ (E- 1 + X'X}"1(1J+X'Y),
of
the
ci (E-1 + X'X)"1]
The Implementation will consist of building
screens that will interactively collect
information
about
the
independent,
dependent variables and prior information.
The back-end programs will contain the
calculations listed above for the linear
regression model. Since all the calculations
involve matrices, SASIIML will be used.
P(~IX,Y) = IG[ (r+k}/2, (W+ vA)/2]
All input to be chosen from selection lists.
Where
Model Screen:
A= Y'[l(n)-X'(E-1 + X'X}" 1X]Y.
Enter Data
variables:
The predictive pdf for future values can be
derived similar to the Jeffreys prior case.
source
for
independent
Enter independent variables:
Bayesian Analysis in a Data Warehouse
During the concept phase, all the business
questions that will use the Bayesian
analysis must be identified. For instance, in
trying to predict future total sales amount
from independent variables such as
income, age, etc., it may be decided that a
Bayesian analysis should be used. All the
statistical information such as predictive
value selection must also be identified.
How the prior and posterior distributions will
be stored, retrieved and used must also be
detailed.
All the meta information
associated with the priors and posteriors
must be enumerated.
294
Enter Data source for dependent variable:
Enter Dependent variable:
Screen for
S ecification
Informative
Prior
Data
Enter Prior Data source for Regression
Parameters: - - - - - - - - -
Ba esian Anal sis Screen
Enter Prior Data source: - - - - - Enter Unique Prior name: - - - - - -
Construct Prior distribution
Optional:
Time stamp: _ _ Date Stamp: _ __
Enter Unique Name:---~-­
Enter Number of parameters (k): _
Do posterior parameters need to be stored I
asaprior? _Yes _No
Enter Prior parameters for r:i
lfYes,
Enter Unique Name: _ _ __
R:
I
W:·---
Enter Prior parameters for
f3
tJ:_
:E:
The values entered in the prior parameter
screen will be stored in the data source
specification with a user id, time and date
stamp. The unique name should identify
the prior for the Bayesian analysis.
The Bayesian analysis should automatically
include a Jeffreys or other non-informative
prior analysis. One of the nice features of
Bayesian analysis is when more data is
included later for the same prior, the
analysis does not have to be performed for
all the data again. The posterior distribution
from the previous analysis can be used as
the prior distribution with the current data to
yield the same results. The posterior can
be optionally stored in the prior data source
with a unique name, user id, time and date
stamps.
The results from the Bayesian analysis can
be stored in data sources along with the
prior name, user id, time and date stamp for
future reference.
Conclusions
Bayesian analysis enhances the utility and
sensitivity of available statistical data
Using concepts
analysis methods.
developed for the data warehouse,
Bayesian analysis can be implemented
much more easily than was possible in the
past. SAS Software allows developers to
implement sophisticated analysis in a data
seamless
the
through
warehouse
integration of its different modules. In this
case, the data management system
interacts with the screens built in SAS/AF
and the analysis programs written in
SAS/IML. Data warehouses need not rely
just on DSS and data mining tools to yield
gems or nuggets of information. Bayesian
Analysis offers a new universe of statistical
models allowing users to exploit data
warehouses by combining prior information.
References
Berger, J.O. (1985) Statistical Decision Theory and
Bayesian Analysis, Springer-Verlag, New York
295
Gelman, A., Carlin, J.B., Stem, H.S. and Rubin, D.
(1995) Bayesian Data Analysis, Chapman and Hall,
New York
Welbrock, P.R. (1998} Strategic Data warehousing
principles using SAS Software, Cary, NC: SAS
Institute Inc.
Zellner, A. (1985) An Introduction to Bayesian
Analysis in Econometrics, Krieger Publishing,
Malabar, Florida
SAS is a registered trademark of the SAS Institute
Inc., Cary, NC, USA.
Ramanan Gopalan, Ph.D
Data lnfoworks, Inc.
1015 Helen Avenue, Suite B
Sunnyvale, CA 94086
Email: [email protected]
296