Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehouses and Bayesian Analysis - A match made by SAS Ramanan Gopalan, Data lnfoworks, Inc., Sunnyvale, CA ABSTRACT Data warehouses have the advantage of storing and retrieving large amounts of data in a coherent and efficient manner. Bayesian analysis uses large amounts of data either from outside sources or previous analyses. SAS is probably the only software environment that allows seamlessly connecting data management Bayesian and warehouses through analysis. In this paper, I will demonstrate how data warehouses can be used as a tool in tracking prior data and posterior estimates in a Bayesian data analysis model for linear regression. SAS/AF will be the front-end tool with PROC IML performing the Bayesian Regression model. Key words: regression prior, posterior, linear INTRODUCTION information make warehouses Data available for fulfilling decision support enterprise. an throughout needs warehouse data the Consequently, building is a business process that involves technology to the extent that it can facilitate achieving the enterprise requirement. The true benefits of a data warehouse are realized only when it is exploited using statistical tools. One of the most commonly used tools is the This decision support system (DSS). usually comprises mostly of summary tables and slices of the data to aid in immediate decision making on a daily or frequent basis. For instance, a car dealer may want to know the number of cars in a particular model he sold in the last month. The DSS will contain a report for number of cars sold broken down by model. In many data warehouse applications, users may be satisfied with just receiving such reports Usually no further data from a DSS. analysis is performed. Recently, there has been a lot of interest in data mining - analysis techniques used to uncover patterns in the data that were not anticipated or were hidden from view. Neural networks and CART are two of the techniques widely used in data mining software such as the SAS Enterprise Miner. Time series analysis and other trending techniques are also used to reveal any dependence on time factors. Bayesian analysis of data allows the users to incorporate their subjective beliefs into the statistical analysis. These beliefs can either be based on previous data (databased priors) or opinions about the value of the parameter being evaluated (judgemental priors). In either case, the priors will have to be evaluated and formulated before any further analysis based on them can be done. In this conceptual paper, I will explain with details Bayesian analysis, prior and posterior distributions and how the SAS system can be used to evaluate priors and present posterior analysis in a data warehousing environment. 291 Statistical Inference The main purpose for building a statistical model for given data is to infer findings to a more general population from which the data is a sample. All the information from given data can be summarized by the likelihood function. There are two schools of thought on how an inference can be made. The classical or frequentist school assumes that the model parameters are fixed but unknown and the data given is only one among many that can be obtained from the model. Conclusions are made using p-values and significance levels. The p-value tells us how likely it is that similar data can be generated by the underlying The model under identical conditions. smaller the p-value the more unlikely that the underlying model generated the data. Bayesian inference on the other hand assumes that the model parameters are only i.e., random, and unknown probabilities can be assigned to values of the unknown parameters. Conclusions are based on the posterior distributions of the the contain all which parameters, after information about the parameters This approach is observing the data. appealing to many practitioners because it does not assume repeated sampling and allows the user to incorporate their own beliefs into the inference. Prior information is used sometimes in frequentist inferences. One is supposed to accept or reject a null hypothesis not only based on the p-values obtained but also any knowledge a person may have about the inferential problem at hand. This is why in many instances, statistical significance does not lead to practical significance and vice versa. The Bayesian approach allows a user to formally incorporate prior information in the 292 form of a distribution. After combining with the data, we obtain the posterior distribution that is really a weighted average between the prior and the model likelihood. Probabilities derived from the posterior distribution (posterior probabilities) are used to make judgements about values of a parameter given the data. Bayesian Analysis All the information needed to make inferences in Bayesian analysis can be found in the posterior distribution. Bayes theorem elegantly gives the posterior distribution as: Prior X Likelihood In many simple cases the posterior. distribution is easily obtained as a form of some known. distribution. In other more complex case, it may be hard to evaluate the posterior distribution directly. Recent advances allow for numerical evaluations using Monte Carlo Markov Chain methods (See Gelman, Carlin, Stem and Rubin, 1995). There are other texts that deal with different aspects of the Bayesian paradigm, ranging from philosophical issues about probability to application methods (See Berger, 1985 and Zellner, 1985). Prior Distributions In any analysis, once the statistical model is chosen, the likelihood is also determined In using distributional assumptions. additional the is there Bayesian analysis, step of choosing a prior distribution. When there is no information available about a parameter, usually a noninformative prior distribution is chosen. These priors do not contribute any weight to the posterior. On the other hand, when information is available about a parameter of interest in the form of previous analysis performed or data available, informative priors can be formed. When the prior and posterior distributions belong to the same family of probability distributions, the prior is called a conjugate prior. For instance, when the prior is a beta distribution and the likelihood is a binomial distribution, the posterior is also a beta distribution. In this case the beta distribution is a conjugate prior for the binomial likelihood. identified. During transition, all the data sources corresponding to the response and independent variables are identified. In addition, data sources for the prior as well as destination for the posterior analysis must be determined as well. Implementation involves constructing the front-end screens and the back-end programs that perform the Bayesian analysis. In a typical Bayesian analysis, the posterior distribution is evaluated for many priors (at least 2 - the non-informative and one informative prior) for comparison and robustness. Since the informative prior involves processing data, the data warehouse model is one of the methods suitable not only for constructing the prior, but also calculating the posterior distribution and tracking all the meta information relating to the data sets. In a normal linear regression model, we have the dependent variable Y, as a n x 1 vector of observations, the independent variables X, as a n x k matrix of values and ~ as a k x 1 vector of parameters. The model is specified as: Data Warehouses and OLAP Building a strategic data warehouse phases: three involves application conceptualization, transition and implement -ation (Welbrock, 1998). At the concept stage, all the business questions that the must answer are data warehouse determined and the metadata {information about data) is created. During transition, all the data sources are identified, cleaned, summarized and stored in the warehouse. In the implementation stage, the OLAP is built that allows users to view different aspects of the data at any level and obtain DSS reports. Bayesian analysis in the data warehouse environment will closely follow the three phases of construction. Since one of the main uses for any statistical analysis is prediction, during the concept stage, response and independent variables along with the priors and likelihoods must be Bayesian Analysis of Linear Regression Y=XJ3+e Where e is a n x 1 vector of error terms having the normal distribution with mean 0 and variance c?-. In a Bayesian analysis the parameters ~ and cr2 are assumed to be random. The likelihood function is given by: L{f3,cr2 1X,Y) oc Exp[-0.5 (Y- X f3)'(Y- X ~)/cf2 ]/ crk Jeffreys prior is a non-informative prior that can be imposed on both ~ and d, so that the posterior distribution for these parameters are: 1 2 P(~lcf2, X)= N[o, {X'Xr cr P(cr21X) Where ] =IG[(n-k)/2, (n-k)s2/2]. o = (X'Xr1X'Y, s = Y'[l(n)-X(X'Xr1X]Y/{n-k), 2 N is the normal distribution and IG is the Inverted gamma distribution. The posterior 293 distribution for a p x 1 vector of predicted values, Z, is given by: P(ZIXz,X,Y) =T[Xzo, V, v] where T is the Student's t distribution, V = vs 2{1(p)+Xz(X'X)"1Xz'}/(v-2) and v = n-k. One of the informative priors of interest is the conjugate prior given by: In the transition phase, all the data sources in the warehouse that will contribute to the dependent and independent variables must be identified. Here metadata and source data registration in the warehouse become crucial steps. Separate tables are necessary for storing prior and posterior information that allow for updating. This feature is somewhat unique, because updating does not usually take place in a data warehouse environment. P(~lci) = Nk[J.I,cr2E] 2 P(cr ) = JG[r/2, W/2] The posterior para_meters are distributions P(f3ld,X, Y) = N[ (E- 1 + X'X}"1(1J+X'Y), of the ci (E-1 + X'X)"1] The Implementation will consist of building screens that will interactively collect information about the independent, dependent variables and prior information. The back-end programs will contain the calculations listed above for the linear regression model. Since all the calculations involve matrices, SASIIML will be used. P(~IX,Y) = IG[ (r+k}/2, (W+ vA)/2] All input to be chosen from selection lists. Where Model Screen: A= Y'[l(n)-X'(E-1 + X'X}" 1X]Y. Enter Data variables: The predictive pdf for future values can be derived similar to the Jeffreys prior case. source for independent Enter independent variables: Bayesian Analysis in a Data Warehouse During the concept phase, all the business questions that will use the Bayesian analysis must be identified. For instance, in trying to predict future total sales amount from independent variables such as income, age, etc., it may be decided that a Bayesian analysis should be used. All the statistical information such as predictive value selection must also be identified. How the prior and posterior distributions will be stored, retrieved and used must also be detailed. All the meta information associated with the priors and posteriors must be enumerated. 294 Enter Data source for dependent variable: Enter Dependent variable: Screen for S ecification Informative Prior Data Enter Prior Data source for Regression Parameters: - - - - - - - - - Ba esian Anal sis Screen Enter Prior Data source: - - - - - Enter Unique Prior name: - - - - - - Construct Prior distribution Optional: Time stamp: _ _ Date Stamp: _ __ Enter Unique Name:---~- Enter Number of parameters (k): _ Do posterior parameters need to be stored I asaprior? _Yes _No Enter Prior parameters for r:i lfYes, Enter Unique Name: _ _ __ R: I W:·--- Enter Prior parameters for f3 tJ:_ :E: The values entered in the prior parameter screen will be stored in the data source specification with a user id, time and date stamp. The unique name should identify the prior for the Bayesian analysis. The Bayesian analysis should automatically include a Jeffreys or other non-informative prior analysis. One of the nice features of Bayesian analysis is when more data is included later for the same prior, the analysis does not have to be performed for all the data again. The posterior distribution from the previous analysis can be used as the prior distribution with the current data to yield the same results. The posterior can be optionally stored in the prior data source with a unique name, user id, time and date stamps. The results from the Bayesian analysis can be stored in data sources along with the prior name, user id, time and date stamp for future reference. Conclusions Bayesian analysis enhances the utility and sensitivity of available statistical data Using concepts analysis methods. developed for the data warehouse, Bayesian analysis can be implemented much more easily than was possible in the past. SAS Software allows developers to implement sophisticated analysis in a data seamless the through warehouse integration of its different modules. In this case, the data management system interacts with the screens built in SAS/AF and the analysis programs written in SAS/IML. Data warehouses need not rely just on DSS and data mining tools to yield gems or nuggets of information. Bayesian Analysis offers a new universe of statistical models allowing users to exploit data warehouses by combining prior information. References Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, New York 295 Gelman, A., Carlin, J.B., Stem, H.S. and Rubin, D. (1995) Bayesian Data Analysis, Chapman and Hall, New York Welbrock, P.R. (1998} Strategic Data warehousing principles using SAS Software, Cary, NC: SAS Institute Inc. Zellner, A. (1985) An Introduction to Bayesian Analysis in Econometrics, Krieger Publishing, Malabar, Florida SAS is a registered trademark of the SAS Institute Inc., Cary, NC, USA. Ramanan Gopalan, Ph.D Data lnfoworks, Inc. 1015 Helen Avenue, Suite B Sunnyvale, CA 94086 Email: [email protected] 296