Download Visual Optimization: Using SAS® And Java To Explore High Density Data In Equity Portfolios

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
NESUG 15
Graphics & Information Visualization
VISUAL OPTIMIZATION: USING SAS AND JAVA
TO EXPLORE HIGH DENSITY DATA IN EQUITY PORTFOLIOS
Haftan Eckholdt and Ezra Benun, DayTrends, LLC, Brooklyn, NY
The second conclusion is that auto-correlation further obscures the
recognition of effects (serial dependency: Jones, Weinrott, and
Vaught, 1978; Wampold and Furlong, 1981) – viewers see fewer
effects within and between temporal events where the data are
highly correlated like those found in finance and economics.
Unfortunately these experiments were all conducted with two
dimensional, still graphical representations of data that show
changes over time. There are no similar experiments that assess
the factors affecting visual analysis of data animations. Only a few
researchers have addressed animation and their use in data
analysis (Palovcik, et al, 1992; Toga and Payne, 1991).
ABSTRACT
Computational constraints prevent analysts from exploring and
understanding alternative approaches to equity trading. We will
discuss methods for using SAS visual tools to assess various
strategies to equity investment that were back tested simulations.
Visualization tools in SAS and JAVA were used to identify
desirable surfaces on multidimensional objects that were more
deeply scrutinized in lower dimensional objects to understand their
behavior. This method is an approach to diagnosis and prognosis
of portfolio investment and management strategies that could then
be used to reduce costs, minimize volatility, etc. For this
discussion, various strategies to investing in members of the
S&P500 Index were back tested and reviewed for desirable
returns. Characterizing thousands of portfolio strategies in sets of
multidimensional objects can give analysts a broader perspective
on their investment strategies.
As the complexity of the models increases we may need to add
another dimension to the display. Animation can provide that
added dimension. Movements in space allow us to appreciate
complex spatial relationships because there are many more 3-D
cues presented in a given amount of view time. However, presentday algorithms and hardware make this process difficult and
expensive (Toga, 1990; page 269).
INTRODUCTION
Now that powerful analytic tools like computer clusters are so
affordable and simple to maintain, previously unexplored ideas can
now be studied. In finance, one of several disciplines traditionally
associated with very large data structure, these unexplored ideas
might include different investment vehicles, different investment
strategies, different amount of history employed, or some
combination of all of the above. Even with such power analytic
tools, approaches to identifying successful strategies are not yet
established. Some possible tools might come from other data
intensive disciplines such as genetics, neuroscience, or number
theory.
Organizing Principles
Several authors have published texts on the topic of the visual
display of information. The dialogue on this topic began for
quantitative scientists with the publication of Exploratory Data
Analysis by John Tukey (1970). This book may have been one of
the first compilations of many examples with instructions of “looking
at data to see what it seems to say”. In his preface, Tukey
suggests several principles of Exploratory Data Analysis that
address:
“a basic problem about any body of data to make it more easily
and effectively handleable by minds…
Research on Visual Analysis
Tukey (1977) has been often cited as one of the first people to
quantify and qualify the visual analysis of data with his germinal
edition on Exploratory Data Analysis.
The principles and
procedures described in that volume have become a requirement
for the initial descriptive analyses performed on almost any data
set. His beginnings were further refined with books that provide
deeper, more detailed examples of general graphical events
(Chambers, Cleveland, Kleiner, and Tukey, 1983; Cleveland, 1985)
as well as neurophysiologic graphical events (Toga, 1990; Toga
and Mazziotta, 1996). Tufte has produced a series of books on
graphical events that may be considered a beautiful finale to this
dialogue (Tufte, 1983; Tufte, 1990; Tufte, 1997). Each of Tufte’s
books provides the reader with examples of how different kinds of
information can be displayed, and in some cases, a set of clear
criteria for assessing the utility of a good display. There are two
levels of dialogue critical to the visual analysis of large
spatiotemporal events that are often missing from these books: (1)
discussion of the impact of a graphic on the researcher, and (2)
discussion of animation.
•
•
•
•
anything that makes a simpler description possible makes the
description more easily handleable.
anything that looks below the previously described surface
makes the description more effective.
to be able to say that we looked one layer deeper, and found
nothing, is a definite step forward – though not as far as to be
able to say that we looked deeper and found thus-and-such.
to be able to say that “if we change our point of view in the
following way… things are simpler” is always a gain… ”
Tukey’s well established principles can be used to construct
algorithms that help analysts represent, describe and question their
strategies. In this way, each spatiotemporal event can be parsed
into component dimensions that are common: space and time, and
unique: market condition, within or between equity factors. Other
dimensions in high dimensional research described by Mazziotta
(1996), like surface rendering, can also be codified.
Characterizing the essence of a financial strategy requires
mapping each relevant dimension in a frame or object through
animation to represent another dimension. Keep in mind that other
critical dimensions can be represented in the visualization by using
color, shape, size, and maybe even sound.
Several researchers have tried to measure the impact of a graphic
on a viewer empirically. In these experiments, the viewers, usually
quantitative or behavioral science professionals, are shown plots of
data that represent various levels of effect size. From these
experiments two broad conclusions can be drawn. The first
conclusion is that visual analysis of data tends to be more
conservative than statistical analysis – viewers do not see effects
that are statistically significant (DeProspero and Cohen, 1979;
Gibson and Ottenbacher; 1988; Park, Marascuilo, and GaylordRoss, 1990) – thus leading to higher probabilities of Type II error.
Evaluation of the Methods
In order to appreciate the more specific impact of the proposed
data visualizations on the analysis of investment, several issues will
be discussed. Initially, the barriers to investment research that
were enumerated earlier will be acknowledged along with solutions.
This is followed by a discussion on the errors in research
Page 1
NESUG 15
Graphics & Information Visualization
(strategies, groups, managers, entire firms). This section may in
fact provide a platform for assembling criteria for assessing any
approach to the data.
known phenomena but also, more interestingly, as participants in
the process of solving problems not yet completely answered (Toga
and Mazziota, 1996 [emphasis added]).
Barriers overcome
By using the principles outlined by Tukey (1970) to organize data
visualization along the critical dimensions identified by Mazziotta
(1996), researchers in economice will gain a clearer
understanding of their data, how to approach their questions about
the data, and how to convey their findings.
New hypotheses are so often implied or “called up” by the
inspection of existing data. In this light, one could even consider
that some of the most important questions in the field of
neuroscience might never be asked for lack of adequate prehypothesis stimulation necessary to invoke the question in
researchers. This oversight, yet to be named formally, could be
thought of as Type V error in this proposal: the probability of
failing to formulate a hypothesis.
Data visualization techniques may be a very powerful way to
address Type II errors in the first case where visual analysis
overcomes the obscuring affects of auto-correlation. Ultimately
this question can only be addressed empirically by back testing
and forward testing. One could argue that animating time “breaks
up” the appearance of auto-correlation. This may be especially
true for many analysts who treat time as a critical dimension to be
manipulated just as any other critical dimension. In this later case,
the representation of time could be manipulated to emphasize
changes between adjacent frames of any size. Time could also be
represented in such a way as to essentially illustrate a “lag” of any
size so that multiple data frames are presented to the viewer
simultaneously.
By developing a set of downloadable computer programs,
researchers will be able to manage their directory structures and
file naming systems more efficiently.
By developing and
distributing computer programs to read data, researchers will be
able to analyze their data more quickly. Finally, by describing the
criteria for assessing visualizations, researchers will be able to
characterize and discuss the use and impact of visual analysis on
their research.
Error Reduction
Researchers need to be aware that they are always at risk for
committing errors in their research. The proposed visualizations
address errors that are of special concern to researchers in
finance. Type I error rate, known as a, is the “p value” that many
researchers use to determine the likelihood that their finding may
have occurred from chance. These types of errors are most
embarrassing, and are therefore kept below an industry standard
of 5%. Type II error rate, known as b, represents the probability
that a researcher will fail to recognize an effect that truly exists in
nature (Cohen, 1988). Most people are familiar with Type II error
as it is represented by (1 - b), better known as “power” or the
probability of correctly identifying an effect. Most researchers try
to keep power at about 0.80, which leaves them with a 20% chance
of overlooking or missing real effects. It is unlikely that Type III
and IV errors are calculable, but they do represent mistakes that
can be identified at some point in the analytic process. Type III
error represents the probability of using an inappropriate statistical
procedure to a given set of data (Kaiser, 1960; Kimball, 1957;
Mosteller, 1948), while Type IV error represents the probability of
misinterpreting a statistically significant result (Keselman and
Murray, 1974).
Data visualization may be a powerful way to address Type II errors
in the second case where visual analysis provides analysts with the
necessary pre-hypothesis stimulation necessary to create new
hypotheses (the so called Type V error). This may, in fact, be
impossible to demonstrate empirically. The reader is encouraged
to ask themselves “what new questions might be entertained by
analysts as a result of being exposed to all of the data from their
group in such a principled, rapid, and accessible format?”
Traditionally Type I error is thought to evolve in the analysis of
large spatiotemporal events during the statistical analysis of the
data. Testing a investment strategies often involves statistical
inferences conducted between various combinations of conditions
until the “effect” evolves empirically. This process alone could
inflate Type I error rates as the same data are exposed to multiple
statistical tests. Keep in mind that alpha % of statistical tests result
in the false rejection of the null hypothesis. There is a tradition in
some specialties to use statistical procedures that involve intensive
repeated testing of data along the lines of a “moving t-test” where
the difference between two signals may be tested through several
hundred frames. Thus, one might expect quite a few tests to be
significant in any one experiment purely by chance.
This proposal is especially concerned with Type II error as it
evolves in the field of finance. Such error can find its way into the
testing of a priori hypotheses in economic research when the
effect is obscured by noise. Traditionally, Type II errors are
overcome by maintaining adequate sample sizes and by identifying
(and effectively removing) sources of noise in a model. In finance,
this kind of Type II error is usually overcome by obtaining more
data from an event or process of interest (hundreds of times in the
case of intra day data), and through the application of various
signal filtration devices and algorithms. Using a more liberal
interpretation, Type II errors could by literally “failing to see the
effect”. It is this interpretation of Type II error that may be most
common to financial research in the absence of widely available
super computing systems. From the a priori approach, a analyst
may fail to see the effects designed into the strategy itself, thus
failing to reject the a priori null hypothesis in the face of true
experimental effects. From the post hoc perspective, one might
also ponder the hypotheses that might never be entertained by an
investment group simply because no human has the brain capacity
to store, recall, and compare the spatial temporal events
represented by their data.
This kind of Type I error can best be addressed by improved
statistical techniques, and it sheds light on a way that Type I error
could be evolving in visual analysis. In some respects, the act of
viewing data graphically is equivalent to the act of conducting a
statistical test. This proposition is most salient when reviewing the
empirical literature on visual analysis. Type I errors of this kind
are, in fact, very rare in visual analysis since experts are not
confident that two signals diverge until they are separated by 2-3
standard errors. In most statistical analyses, only 1 standard error
is necessary for two signals to be statistically distinct.
Type I errors could also occur by viewing the data from one
source intensively and then conducting inferential tests on that very
same source. This kind of Type I error could be overcome by
using one source to visually analyze a priori hypothesis and
generate post hoc hypotheses while using data from other, unseen
sources to statistically analyze a priori and post hoc hypotheses.
It is important to note that the value of computerized visualization
(graphic) techniques has been used not only to visualize already
2
NESUG 15
Graphics & Information Visualization
Type III errors occur most often where the nature of the data and
the underlying processes responsible for the data do not match the
assumptions of the statistical model. Simply knowing the market,
or knowing the strategy model is not enough to prevent Type III
errors. Having knowledge of both is essential, and what better way
to quickly and clearly inform an analyst or collaborating manager
of the complexity of the data and its origins than showing different
animations of the strategy under study. In all likelihood the visual
analysis of data will reduce the confusion about the nature of the
overall approach and the specific strategy, while at the same time
demonstrating the complexity of the data along the critical axes of
market, time, and strategy.
Finally, Type IV errors evolve after the analysis when analysts
misinterpret significant statistical findings. Once again, it can be
stated that the visual analysis of data reduces the confusion about
the nature of the strategies being simulated in such a way as to
reduce the chances of misinterpreting a significant finding. Most
of the literature on Type IV errors focuses on the misinterpretation
of significant interaction terms. The visual analysis of economic
data in animation or high dimension still frames can only help the
analysts to understand the meaning of the interaction. In fact,
certain interactions can be designated as critical dimensions in
and of themselves. This way, some animations can be constructed
to emphasize the (non) existence of interactions most relevant to
the process under study. In this application, visual analysis
represents a leap ahead of the traditional method of interpreting
interactions in the large, multidimensional data space, looking at F
tests, waving hands, drawing diagrams, and comparing statistics
across cells/factors.
Overall, there were 31 stock-years represented in the chart, and
we are missing a cumulative total of 6 stock years, primarily from
the period of 1970 to 1980. Only 3.5% of all stock identified have
absolutely no price history. Data from the close prices were used
to simulate trade from a number of strategies.
Visualizing Time
In order to address the basic needs of the investment analysts, a
rapid data management, analysis, and animation system was
drafted to back test daily returns from 1970 to present and from
1990 to present depending upon the focus of the modeling. Initial
strategies were based exclusively on S&P500 Index members
(DayTrends, 2000) in contrasted with Index performance and
weighting.
Figure 1. Ten year back test of log $1 return of 2 blended
strategies (blue) and the SP500 Index (red) against time.
DEMONSTRATION
For this discussion, various strategies to investing in members of
the S&P500 were back tested and reviewed for desirable returns.
Those strategies reaching criteria were further studied and
optimized with secondary strategies to reduce cost, reduce
volatility, and finally, to reduce risk. In each step, SAS visual tools
were used to identify strategies of interest. While the immediate
goal of this discussion was to perfect the data acquisition and data
animation systems in SAS, the reader should note that JAVA is
also used out of convenience.
The code necessary to generate Figure 1, once the portfolio
investment has been fully simulated, is done with ODS Output
options designed to produce GIF images most simply imbedded in
HTML objects.
Data Sources
Pilot data for this paper were derived from the S&P500 Index.
Membership history was taken from published records of SP500
membership and equity histories were bought from suppliers close
equity data (TC2000, etc.). Overall, the portfolio averages 150-200
equities when full, with an average hold less than 1 month. Costs
were $0.01 / share each way and $10 per trade each way.
Strategies were traded with the goal of keeping data from 1981 to
2000 "out of sample". There was no cost process estimated for
transitioning in and out of strategies.
Figure 1. Percent of equities identified as members of the Index in
blue, and the percent of equity histories acquired.
filename odsout “c: \SP500S\DOCS\";
ods
html
nogtitle;
path=odsout
body="&SELECT..HTML"
goptions
reset=global
gunit=pct border
colors=(red blue green)
ftext=swiss ftitle=swissb htitle=6 htext=3
device=gif transparency noborder
ctext=WHITE CBACK =BLACK;
symbol1
i=JOIN
v=POINT
COLOR=BLUE
;
symbol2
i=JOIN
v=point
COLOR=RED
;
axis1
width=2.0
3
NESUG 15
Graphics & Information Visualization
LABEL=("Time")
;
axis2
width=2.0
LABEL=("Log $1 Invested")
order=( -1 to 4 by 1)
;
To further study the relationships among different strategies, we
performed virtual surgery by constructing virtual objects, usually
spheres) sphere in the (x,y,z) coordinate system, and placing its
center so that the outer boundary of the sphere fell along the
relevant surfaces. The final product can be viewed in the
Interactive Data Analysis object is a new variable is constructed to
produce additional X,Y,Z distance from the center, or an animation
can be output in PROC G3D. By “peeling back” the outer layer, by
adding an (x,y,z) constants to all data points that fell outside of the
sphere.
proc gplot;
TITLE2 "&SELECT fro m 1990-2000";
plot (DTLLC SP500)*TIME / overlay
haxis=axis1
vaxis=axis2
LEGEND=LEGEND1
vref = 0 1 2 3;
LEGEND1
LABEL=NONE
MODE=PROTECT
VALUE =
(COLOR=BLUE "DayTrends"
COLOR=RED "SP500");
SPHERE = (X)**2 + (Y+2)**2 + (Z -3)**2;
IF SPHERE < 15 THEN YC = -Y +6;
This new animation revealed previously unseen surfaces that were
never part of the original study design or strategy simulation.
Figure 3. Example of surface peeling using spherical criteria.
run;QUIT;
option nobyline;
ODS HTML CLOSE;
RUN;
File Seeking
Trading portfolios was simulated on a cluster of workstations
(COW) managed by SAS and performance statistics were
gathered from across the cluster using an approach to network file
seeking. In essence, this approach to file management amounts to
a series of X commands designed to output sequential DIR
statements to text files that are subsequently read be SAS as data
files until there are no deeper subdirectories. Each DIR output of
files and directories is appended to a database of files and
directories and drives and machine on a network cluster that can
then be appended with a series of massive data / set statements.
Influence
Once surfaces and sub-surfaces are obtained, the objects of
interest and be reconstructed in 2 Dimensions with color, shape, or
size representing further diagnostically relevant strategies.
Visualizing Space
Once enough strategies have been simulated, they can then be
summarized in terms of various performance statistics, and further
visualized most easily using Interactive Data Analysis.
Figure 4. Return by Volatility among 81,000 from backtests of
1990 to 2000 where each color represents similar decision criteria
(SP500 index return and volatility are represented at reference
lines of 0.14 annual volatility and 0.15 annual return).
Figure 2. 3D object of Volitility x Return x Hold Period (X,Y,Z) for
81,000 portfolios back tested for 10 years from 1990 to 2000.
While the index performed near the minima of the X,Y axes (0.14
annual volatility and 0.15 annual return) the 3D object reveals many
strategies that deliver considerably higher returns per unit of risk.
Spinning this object reveals several surfaces that can be thought of
as representing performance limits in these non-optimized
strategies.
At this point in the process, identifying a family of strategies
become necessary for performing the next steps of model
optimizations. This can be done by viewing the selected returns in
a series of animation frames whereby more specific decision
criteria are specifically manipulated. A JAVA based gif frame
viewer be demonstrated.
4
NESUG 15
Graphics & Information Visualization
CONCLUSION
Characterizing thousands of portfolio strategies in sets of
multidimensional objects can give analysts and money managers a
broader perspective on their investment strategies. In many cases,
there may be investment products underlying currently employed
strategies that would be of interest to new clients.
CONTACT INFORMATION
Haftan Eckholdt
DayTrends, LLC
10 Jay Street, suite 504
Brooklyn, New York 11201
Work Phone: (917) 548-3178
Fax: (718) 522-3170
Email: [email protected]
5