Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NESUG 15 Graphics & Information Visualization VISUAL OPTIMIZATION: USING SAS AND JAVA TO EXPLORE HIGH DENSITY DATA IN EQUITY PORTFOLIOS Haftan Eckholdt and Ezra Benun, DayTrends, LLC, Brooklyn, NY The second conclusion is that auto-correlation further obscures the recognition of effects (serial dependency: Jones, Weinrott, and Vaught, 1978; Wampold and Furlong, 1981) – viewers see fewer effects within and between temporal events where the data are highly correlated like those found in finance and economics. Unfortunately these experiments were all conducted with two dimensional, still graphical representations of data that show changes over time. There are no similar experiments that assess the factors affecting visual analysis of data animations. Only a few researchers have addressed animation and their use in data analysis (Palovcik, et al, 1992; Toga and Payne, 1991). ABSTRACT Computational constraints prevent analysts from exploring and understanding alternative approaches to equity trading. We will discuss methods for using SAS visual tools to assess various strategies to equity investment that were back tested simulations. Visualization tools in SAS and JAVA were used to identify desirable surfaces on multidimensional objects that were more deeply scrutinized in lower dimensional objects to understand their behavior. This method is an approach to diagnosis and prognosis of portfolio investment and management strategies that could then be used to reduce costs, minimize volatility, etc. For this discussion, various strategies to investing in members of the S&P500 Index were back tested and reviewed for desirable returns. Characterizing thousands of portfolio strategies in sets of multidimensional objects can give analysts a broader perspective on their investment strategies. As the complexity of the models increases we may need to add another dimension to the display. Animation can provide that added dimension. Movements in space allow us to appreciate complex spatial relationships because there are many more 3-D cues presented in a given amount of view time. However, presentday algorithms and hardware make this process difficult and expensive (Toga, 1990; page 269). INTRODUCTION Now that powerful analytic tools like computer clusters are so affordable and simple to maintain, previously unexplored ideas can now be studied. In finance, one of several disciplines traditionally associated with very large data structure, these unexplored ideas might include different investment vehicles, different investment strategies, different amount of history employed, or some combination of all of the above. Even with such power analytic tools, approaches to identifying successful strategies are not yet established. Some possible tools might come from other data intensive disciplines such as genetics, neuroscience, or number theory. Organizing Principles Several authors have published texts on the topic of the visual display of information. The dialogue on this topic began for quantitative scientists with the publication of Exploratory Data Analysis by John Tukey (1970). This book may have been one of the first compilations of many examples with instructions of “looking at data to see what it seems to say”. In his preface, Tukey suggests several principles of Exploratory Data Analysis that address: “a basic problem about any body of data to make it more easily and effectively handleable by minds… Research on Visual Analysis Tukey (1977) has been often cited as one of the first people to quantify and qualify the visual analysis of data with his germinal edition on Exploratory Data Analysis. The principles and procedures described in that volume have become a requirement for the initial descriptive analyses performed on almost any data set. His beginnings were further refined with books that provide deeper, more detailed examples of general graphical events (Chambers, Cleveland, Kleiner, and Tukey, 1983; Cleveland, 1985) as well as neurophysiologic graphical events (Toga, 1990; Toga and Mazziotta, 1996). Tufte has produced a series of books on graphical events that may be considered a beautiful finale to this dialogue (Tufte, 1983; Tufte, 1990; Tufte, 1997). Each of Tufte’s books provides the reader with examples of how different kinds of information can be displayed, and in some cases, a set of clear criteria for assessing the utility of a good display. There are two levels of dialogue critical to the visual analysis of large spatiotemporal events that are often missing from these books: (1) discussion of the impact of a graphic on the researcher, and (2) discussion of animation. • • • • anything that makes a simpler description possible makes the description more easily handleable. anything that looks below the previously described surface makes the description more effective. to be able to say that we looked one layer deeper, and found nothing, is a definite step forward – though not as far as to be able to say that we looked deeper and found thus-and-such. to be able to say that “if we change our point of view in the following way… things are simpler” is always a gain… ” Tukey’s well established principles can be used to construct algorithms that help analysts represent, describe and question their strategies. In this way, each spatiotemporal event can be parsed into component dimensions that are common: space and time, and unique: market condition, within or between equity factors. Other dimensions in high dimensional research described by Mazziotta (1996), like surface rendering, can also be codified. Characterizing the essence of a financial strategy requires mapping each relevant dimension in a frame or object through animation to represent another dimension. Keep in mind that other critical dimensions can be represented in the visualization by using color, shape, size, and maybe even sound. Several researchers have tried to measure the impact of a graphic on a viewer empirically. In these experiments, the viewers, usually quantitative or behavioral science professionals, are shown plots of data that represent various levels of effect size. From these experiments two broad conclusions can be drawn. The first conclusion is that visual analysis of data tends to be more conservative than statistical analysis – viewers do not see effects that are statistically significant (DeProspero and Cohen, 1979; Gibson and Ottenbacher; 1988; Park, Marascuilo, and GaylordRoss, 1990) – thus leading to higher probabilities of Type II error. Evaluation of the Methods In order to appreciate the more specific impact of the proposed data visualizations on the analysis of investment, several issues will be discussed. Initially, the barriers to investment research that were enumerated earlier will be acknowledged along with solutions. This is followed by a discussion on the errors in research Page 1 NESUG 15 Graphics & Information Visualization (strategies, groups, managers, entire firms). This section may in fact provide a platform for assembling criteria for assessing any approach to the data. known phenomena but also, more interestingly, as participants in the process of solving problems not yet completely answered (Toga and Mazziota, 1996 [emphasis added]). Barriers overcome By using the principles outlined by Tukey (1970) to organize data visualization along the critical dimensions identified by Mazziotta (1996), researchers in economice will gain a clearer understanding of their data, how to approach their questions about the data, and how to convey their findings. New hypotheses are so often implied or “called up” by the inspection of existing data. In this light, one could even consider that some of the most important questions in the field of neuroscience might never be asked for lack of adequate prehypothesis stimulation necessary to invoke the question in researchers. This oversight, yet to be named formally, could be thought of as Type V error in this proposal: the probability of failing to formulate a hypothesis. Data visualization techniques may be a very powerful way to address Type II errors in the first case where visual analysis overcomes the obscuring affects of auto-correlation. Ultimately this question can only be addressed empirically by back testing and forward testing. One could argue that animating time “breaks up” the appearance of auto-correlation. This may be especially true for many analysts who treat time as a critical dimension to be manipulated just as any other critical dimension. In this later case, the representation of time could be manipulated to emphasize changes between adjacent frames of any size. Time could also be represented in such a way as to essentially illustrate a “lag” of any size so that multiple data frames are presented to the viewer simultaneously. By developing a set of downloadable computer programs, researchers will be able to manage their directory structures and file naming systems more efficiently. By developing and distributing computer programs to read data, researchers will be able to analyze their data more quickly. Finally, by describing the criteria for assessing visualizations, researchers will be able to characterize and discuss the use and impact of visual analysis on their research. Error Reduction Researchers need to be aware that they are always at risk for committing errors in their research. The proposed visualizations address errors that are of special concern to researchers in finance. Type I error rate, known as a, is the “p value” that many researchers use to determine the likelihood that their finding may have occurred from chance. These types of errors are most embarrassing, and are therefore kept below an industry standard of 5%. Type II error rate, known as b, represents the probability that a researcher will fail to recognize an effect that truly exists in nature (Cohen, 1988). Most people are familiar with Type II error as it is represented by (1 - b), better known as “power” or the probability of correctly identifying an effect. Most researchers try to keep power at about 0.80, which leaves them with a 20% chance of overlooking or missing real effects. It is unlikely that Type III and IV errors are calculable, but they do represent mistakes that can be identified at some point in the analytic process. Type III error represents the probability of using an inappropriate statistical procedure to a given set of data (Kaiser, 1960; Kimball, 1957; Mosteller, 1948), while Type IV error represents the probability of misinterpreting a statistically significant result (Keselman and Murray, 1974). Data visualization may be a powerful way to address Type II errors in the second case where visual analysis provides analysts with the necessary pre-hypothesis stimulation necessary to create new hypotheses (the so called Type V error). This may, in fact, be impossible to demonstrate empirically. The reader is encouraged to ask themselves “what new questions might be entertained by analysts as a result of being exposed to all of the data from their group in such a principled, rapid, and accessible format?” Traditionally Type I error is thought to evolve in the analysis of large spatiotemporal events during the statistical analysis of the data. Testing a investment strategies often involves statistical inferences conducted between various combinations of conditions until the “effect” evolves empirically. This process alone could inflate Type I error rates as the same data are exposed to multiple statistical tests. Keep in mind that alpha % of statistical tests result in the false rejection of the null hypothesis. There is a tradition in some specialties to use statistical procedures that involve intensive repeated testing of data along the lines of a “moving t-test” where the difference between two signals may be tested through several hundred frames. Thus, one might expect quite a few tests to be significant in any one experiment purely by chance. This proposal is especially concerned with Type II error as it evolves in the field of finance. Such error can find its way into the testing of a priori hypotheses in economic research when the effect is obscured by noise. Traditionally, Type II errors are overcome by maintaining adequate sample sizes and by identifying (and effectively removing) sources of noise in a model. In finance, this kind of Type II error is usually overcome by obtaining more data from an event or process of interest (hundreds of times in the case of intra day data), and through the application of various signal filtration devices and algorithms. Using a more liberal interpretation, Type II errors could by literally “failing to see the effect”. It is this interpretation of Type II error that may be most common to financial research in the absence of widely available super computing systems. From the a priori approach, a analyst may fail to see the effects designed into the strategy itself, thus failing to reject the a priori null hypothesis in the face of true experimental effects. From the post hoc perspective, one might also ponder the hypotheses that might never be entertained by an investment group simply because no human has the brain capacity to store, recall, and compare the spatial temporal events represented by their data. This kind of Type I error can best be addressed by improved statistical techniques, and it sheds light on a way that Type I error could be evolving in visual analysis. In some respects, the act of viewing data graphically is equivalent to the act of conducting a statistical test. This proposition is most salient when reviewing the empirical literature on visual analysis. Type I errors of this kind are, in fact, very rare in visual analysis since experts are not confident that two signals diverge until they are separated by 2-3 standard errors. In most statistical analyses, only 1 standard error is necessary for two signals to be statistically distinct. Type I errors could also occur by viewing the data from one source intensively and then conducting inferential tests on that very same source. This kind of Type I error could be overcome by using one source to visually analyze a priori hypothesis and generate post hoc hypotheses while using data from other, unseen sources to statistically analyze a priori and post hoc hypotheses. It is important to note that the value of computerized visualization (graphic) techniques has been used not only to visualize already 2 NESUG 15 Graphics & Information Visualization Type III errors occur most often where the nature of the data and the underlying processes responsible for the data do not match the assumptions of the statistical model. Simply knowing the market, or knowing the strategy model is not enough to prevent Type III errors. Having knowledge of both is essential, and what better way to quickly and clearly inform an analyst or collaborating manager of the complexity of the data and its origins than showing different animations of the strategy under study. In all likelihood the visual analysis of data will reduce the confusion about the nature of the overall approach and the specific strategy, while at the same time demonstrating the complexity of the data along the critical axes of market, time, and strategy. Finally, Type IV errors evolve after the analysis when analysts misinterpret significant statistical findings. Once again, it can be stated that the visual analysis of data reduces the confusion about the nature of the strategies being simulated in such a way as to reduce the chances of misinterpreting a significant finding. Most of the literature on Type IV errors focuses on the misinterpretation of significant interaction terms. The visual analysis of economic data in animation or high dimension still frames can only help the analysts to understand the meaning of the interaction. In fact, certain interactions can be designated as critical dimensions in and of themselves. This way, some animations can be constructed to emphasize the (non) existence of interactions most relevant to the process under study. In this application, visual analysis represents a leap ahead of the traditional method of interpreting interactions in the large, multidimensional data space, looking at F tests, waving hands, drawing diagrams, and comparing statistics across cells/factors. Overall, there were 31 stock-years represented in the chart, and we are missing a cumulative total of 6 stock years, primarily from the period of 1970 to 1980. Only 3.5% of all stock identified have absolutely no price history. Data from the close prices were used to simulate trade from a number of strategies. Visualizing Time In order to address the basic needs of the investment analysts, a rapid data management, analysis, and animation system was drafted to back test daily returns from 1970 to present and from 1990 to present depending upon the focus of the modeling. Initial strategies were based exclusively on S&P500 Index members (DayTrends, 2000) in contrasted with Index performance and weighting. Figure 1. Ten year back test of log $1 return of 2 blended strategies (blue) and the SP500 Index (red) against time. DEMONSTRATION For this discussion, various strategies to investing in members of the S&P500 were back tested and reviewed for desirable returns. Those strategies reaching criteria were further studied and optimized with secondary strategies to reduce cost, reduce volatility, and finally, to reduce risk. In each step, SAS visual tools were used to identify strategies of interest. While the immediate goal of this discussion was to perfect the data acquisition and data animation systems in SAS, the reader should note that JAVA is also used out of convenience. The code necessary to generate Figure 1, once the portfolio investment has been fully simulated, is done with ODS Output options designed to produce GIF images most simply imbedded in HTML objects. Data Sources Pilot data for this paper were derived from the S&P500 Index. Membership history was taken from published records of SP500 membership and equity histories were bought from suppliers close equity data (TC2000, etc.). Overall, the portfolio averages 150-200 equities when full, with an average hold less than 1 month. Costs were $0.01 / share each way and $10 per trade each way. Strategies were traded with the goal of keeping data from 1981 to 2000 "out of sample". There was no cost process estimated for transitioning in and out of strategies. Figure 1. Percent of equities identified as members of the Index in blue, and the percent of equity histories acquired. filename odsout “c: \SP500S\DOCS\"; ods html nogtitle; path=odsout body="&SELECT..HTML" goptions reset=global gunit=pct border colors=(red blue green) ftext=swiss ftitle=swissb htitle=6 htext=3 device=gif transparency noborder ctext=WHITE CBACK =BLACK; symbol1 i=JOIN v=POINT COLOR=BLUE ; symbol2 i=JOIN v=point COLOR=RED ; axis1 width=2.0 3 NESUG 15 Graphics & Information Visualization LABEL=("Time") ; axis2 width=2.0 LABEL=("Log $1 Invested") order=( -1 to 4 by 1) ; To further study the relationships among different strategies, we performed virtual surgery by constructing virtual objects, usually spheres) sphere in the (x,y,z) coordinate system, and placing its center so that the outer boundary of the sphere fell along the relevant surfaces. The final product can be viewed in the Interactive Data Analysis object is a new variable is constructed to produce additional X,Y,Z distance from the center, or an animation can be output in PROC G3D. By “peeling back” the outer layer, by adding an (x,y,z) constants to all data points that fell outside of the sphere. proc gplot; TITLE2 "&SELECT fro m 1990-2000"; plot (DTLLC SP500)*TIME / overlay haxis=axis1 vaxis=axis2 LEGEND=LEGEND1 vref = 0 1 2 3; LEGEND1 LABEL=NONE MODE=PROTECT VALUE = (COLOR=BLUE "DayTrends" COLOR=RED "SP500"); SPHERE = (X)**2 + (Y+2)**2 + (Z -3)**2; IF SPHERE < 15 THEN YC = -Y +6; This new animation revealed previously unseen surfaces that were never part of the original study design or strategy simulation. Figure 3. Example of surface peeling using spherical criteria. run;QUIT; option nobyline; ODS HTML CLOSE; RUN; File Seeking Trading portfolios was simulated on a cluster of workstations (COW) managed by SAS and performance statistics were gathered from across the cluster using an approach to network file seeking. In essence, this approach to file management amounts to a series of X commands designed to output sequential DIR statements to text files that are subsequently read be SAS as data files until there are no deeper subdirectories. Each DIR output of files and directories is appended to a database of files and directories and drives and machine on a network cluster that can then be appended with a series of massive data / set statements. Influence Once surfaces and sub-surfaces are obtained, the objects of interest and be reconstructed in 2 Dimensions with color, shape, or size representing further diagnostically relevant strategies. Visualizing Space Once enough strategies have been simulated, they can then be summarized in terms of various performance statistics, and further visualized most easily using Interactive Data Analysis. Figure 4. Return by Volatility among 81,000 from backtests of 1990 to 2000 where each color represents similar decision criteria (SP500 index return and volatility are represented at reference lines of 0.14 annual volatility and 0.15 annual return). Figure 2. 3D object of Volitility x Return x Hold Period (X,Y,Z) for 81,000 portfolios back tested for 10 years from 1990 to 2000. While the index performed near the minima of the X,Y axes (0.14 annual volatility and 0.15 annual return) the 3D object reveals many strategies that deliver considerably higher returns per unit of risk. Spinning this object reveals several surfaces that can be thought of as representing performance limits in these non-optimized strategies. At this point in the process, identifying a family of strategies become necessary for performing the next steps of model optimizations. This can be done by viewing the selected returns in a series of animation frames whereby more specific decision criteria are specifically manipulated. A JAVA based gif frame viewer be demonstrated. 4 NESUG 15 Graphics & Information Visualization CONCLUSION Characterizing thousands of portfolio strategies in sets of multidimensional objects can give analysts and money managers a broader perspective on their investment strategies. In many cases, there may be investment products underlying currently employed strategies that would be of interest to new clients. CONTACT INFORMATION Haftan Eckholdt DayTrends, LLC 10 Jay Street, suite 504 Brooklyn, New York 11201 Work Phone: (917) 548-3178 Fax: (718) 522-3170 Email: [email protected] 5