Download extracting formations from long financial time series using data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
BRUSSELS ECONOMIC REVIEW – CAHIERS ECONOMIQUES DE BRUXELLES
VOL. 53 N° 2 SUMMER 2010
EXTRACTING FORMATIONS FROM LONG
FINANCIAL TIME SERIES USING DATA MINING
STELLA KARAGIANNI1, THANASIS SFETSOS2
3
AND COSTAS SIRIOPOULOS
ABSTRACT:
Technical analysis has become a custom decision support tool for traders and analysts, though
not widely accepted by the academic community. It is based on the identification of a series
of well-defined formations appearing over irregular intervals. The same principle forms the
basis for the application of data mining methodologies as a tool to discover hidden patterns
that exist in a time series, which is achieved by a detailed breakdown of historic information.
This paper introduces a methodology for the discovery of formations that exist within a time
series and have high probability of reoccurrence. The methodology was developed in an
efficient manner requiring only a small number of user-specified parameters. Its two main
stages are (a) a modified bottom-up segmentation algorithm with an optimization stage to
reach the optimal number of segments, and (b) a rule extraction algorithm. The developed
methodology is tested on two major financial series, the daily closing values of the SP500
Index and the GB Pound to US Dollar exchange rates.
JEL-CLASSIFICATION: C22, C63, F31.
KEYWORDS: technical analysis, data mining, exchange rates, stock market, pattern
recognition, rule extraction.
1
2
3
Department of Economics, University of Macedonia, 156, Egnatia st., 54006 Thessaloniki, Greece.
EREL, INTRP, NCSR Demokritos, 15310 Aghia Paraskevi, Greece. Email: [email protected]
Dept. of Business Administration, University of Patras, 26500, Patra, Greece. Email:
[email protected]
273
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
INTRODUCTION
Recent empirical studies in the international literature have documented substantial
evidence on the predictability of asset returns and foreign exchange markets using
technical analysis trading rules (Brock et al, 1992; Levich & Thomas, 1993; Chan
et al, 1996; Bessembinder & Chan, 1998; Allen & Karjalainem, 1999; Lo et al,
2000; Fang & Xu, 2003). These studies report that asset returns are correlated, and
hence, some degree of predictability can be captured, by technical trading rules, by
time series models or their combination. In this work a data-driven methodology is
developed to discover formations, of non-constant length, that appear in a
univariate financial data set. The methodology was developed to require only a
limited number of parameters from a potential future user.
Technical analysis is defined as the attempt to identify regularities in the time series
of price and volume information from a financial market, by extracting patterns
from noisy data (Lo, 2000). Analysts use price charts to study market action as a
means to predict future price trends. They believe that prices move in trends, and
that history (price patterns) repeats itself (Murphy, 1986). Price patterns are pictures
or formations, that can be classified into different categories, and have predictive
value. Although there are potentially an infinite number of price formations, the
technical analysis literature (Bulkowski, 2000) suggests that certain price
formations are widely identified in stock and foreign exchange markets.
A step beyond technical analysis is the application of data mining to identify
previously unknown formations in a time series. This technique also provides the
opportunity to generate association rules that linguistically describe the examined
process. Weiss and Indurkhya (1998) define data mining as “the search for valuable
information in large volumes of data. Predictive data mining is a search for very
strong patterns in big data sets that can generalize to take accurate future
decisions”. There are two main types of data mining applications in finance. The
first is that some predefined structure of the series exists and an indexing approach
aims to identify similar behaviour within the dataset. The alternative is to apply
theses methodologies to discover previously unknown patterns within the series and
examine reoccurrence characteristics for prediction purposes.
In the literature there are several techniques developed to analyse financial data and
extract information about their future behaviour. Zeng et al (2001) proposed a
hybrid approach for the prediction of stock market trend using pattern recognition
and classification. A set of vectors describing future patterns were introduced which
were classified based with a probabilistic relaxation algorithm. The approach was
tested on the Air-Quantas stock price and a 68% success was reported. Leigh (2002)
introduced a method for template matching to predict future trends of a series. The
nature of the produced results resembled those produced from technical analysis.
On the NYSE composite index he reported similar success in the prediction of a 5day pattern.
Das et al (1998) proposed an approach for the discovery of rules from sequential
observations, using discrete representation. The rules included local relationships in
the series using a finite number of primitive shapes, initially found from a greedy
274
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
clustering algorithm. Experimental results were presented for ten companies of the
NASDAQ stock market for a period of 19 months. Lu et al. (1998) studied stock
series for extracting inter-transaction association rules. They defined events on the
time scale and tried to extract relations between similar patterns. The algorithm was
limited to a fixed "sliding window" time interval. Only associations among the
events within the same window were extracted. A demonstration of a single
dimensional database of a stock market was given, but without prediction results.
Last et al (2001) introduced a general methodology for knowledge discovery in
time series databases. The process included cleaning of the data, identifying the
most important predicting attributes, and extracting a set of association rules to
predict the series future behaviour. The computational theory of perception was
used to reduce the set of extracted rules by fuzzification and aggregation. Povinelli
and Feng (2003) proposed a method that identified predictive temporal structures in
reconstructed phase spaces. A genetic algorithm was employed to search such phase
spaces for optimal heterogeneous clusters that are predictive of the desired events.
Their approach was able to classify time series events more effectively compared to
Time Delay Neural Networks and the C4.5 algorithm.
This works introduces a two-stages methodology to identify formations of nonconstant length appearing within a financial time series. Initially, the series is
represented in linear segments using a modified bottom-up approach. Furthermore,
one of the most vague issues in similar studies, the determination of the optimal
number of segments is addressed. Then, a rule extraction algorithm is applied to
estimate those series formations with a higher probability of reoccurrence. The
proposed approach is applied on two major financial series, the daily closing values
of the SP500 Index and the GB Pound to US Dollar exchange rate.
I. METHODOLOGY
In the literature there are a number of authors discussing the issue of similarity for
time series segments. The problem is defined in Hetland (2004) as: “For a sequence
q, find the nearest ones from a set x. These are defined as those whose distance,
using any selected measure, from the original is below a predefined tolerance
threshold.” The problem gets further complicated when the original segments are
not previously known and the examined segments are of non-constant length.
The target of the developed methodology is to identify patterns or formations, of a
time series that have high probability of reoccurrence, without any previous
knowledge about their characteristics. It is designed to produce fast and easily
interpretable results with only a small number of parameters that need to be
selected. Furthermore, as will be discussed in the following paragraphs, all parts of
the methodology present an improvement over previous research (Kim et al, 2000;
Park et al, 2000; Caraca-Valente & Lopez-Chavarrias, 2000; Last et al, 2001; Lin et
al, 2002; Povinelli & Feng, 2003; Sarker et al, 2003). In this work the developed
methodology is tested to financial data, although its implementation to type of data
is straightforward. The various stages of the process are:
 Data Cleaning. A suitably selected low pass filter is used to reduce the amount
of noise present in the series.
275
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
 Piecewise Linear Segmentation. The bottom up segmentation algorithm is
applied with the modification of joining nearest segments if they have similar
trends. An optimization phase is included to identify the optimum number of
segments.
 Symbolic Representation. The resulting segments of the series are then
classified into a finite alphabet.
 Rule Extraction. Finally, the sequential characteristics of the alphabet are
extracted in the form of symbolic rules.
A. DATA CLEANING
Initially, the original time series was smoothed using a low-pass digital FIR filter,
(1). The signal contains lower amount of noise which brings to light characteristics
such as trend. The coefficients, ακ and bκ, are estimated so that the resulting signal
has the desired behaviour in terms of frequency response. The low-pass filter cutoff frequency was determined using Welch’s (1967) power spectral density
algorithm.
N
M
k 0
k 1
yt   ak xt k   bk yt k
(1)
B. LINEAR SEGMENTATION WITH MATCHING TRENDS
The epicentre of all subsequent matching approaches is the efficient representation
of the time series, which substantially speeds up the rule extraction algorithm.
Argawal (1993) gathered information about the characteristics of a series using the
F-index methodology, later evolved by Faloutsos (1994) into the ST-index to
account for subsequence matching. Yi and Faloutsos (2000) and Keogh et al (2001)
derived independently a piecewise constant approximation approach to divide each
sequence into equally spaced segments, using the average value of each as a
signature.
Keogh and Smyth (1997) applied a probabilistic landmark-based technique, where
the extracted features are characteristic parts of the sequence. Perng et al (2000)
applied a similar approach that identifies points of “great importance” as landmark
features. The specific form used defined landmarks as points where the
approximation function’s nth derivative is zero. Kim et al (2001) used as landmarks
the boundaries of the segments as well as minimum and maximum points.
Many researchers attempt to capture the main characteristics of the data using
piecewise linear approximation. There are three main algorithms applied for this
task: sliding windows, top-down and bottom-up, which are combined with two
alternatives approximations: interpolation and regression. For an excellent review
and performance comparison of these approaches the reader is directed to the work
of Keogh at al (2004). The authors also proposed the SWAB algorithm, a
combination of sliding windows and bottom-up, which was empirically shown to be
superior to any other examined.
276
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
The proposed algorithm is a modified bottom up algorithm with a linear
interpolation between end-points, also including an extra phase that forces the
merging of nearest segments if their trends are similar, (Figure 1). Two trends are
presumed similar if their difference is below a predefined threshold. Furthermore,
we address one of the most vague issues in similar studies, the selection of an
optimum number of segments. Thus, the only parameters that need to be selected
are the minimum number of data contained in each segment and the trend similarity
threshold.
This algorithm requires the use of a cost function. It assists in the process of
selecting the appropriate segments to merge, and is the critical measure for the
selection of the optimal number of segments. The selected cost function in this
work is the squared sum of the Euclidean distance of all members of a segment
from the linear interpolant between the end points, i.
 k

In the case of merging segments e    dist ( y j  I k ) 


 j 1

In case of total series cost
E




2

dist ( y j  I k ) 

j 1

k
(2a)
2
 
all
segments
(2b)
FIGURE 1. BOTTOM-UP ALGORITHM WITH MATCHING TRENDS
Define: minimum segment length, dmin and trend matching threshold εt.
Step 1
Split series into segments with length dmin,
Assign any remain data to the first or last segment
Merge nearest segments if their trend is similar
Store number of total segments n_seg(1) , and total cost of series E(1)
Step 2
Loop all segments
Estimate the merging cost of two nearby segments, e
Find segments with minimum e and merge
Step 3
Estimate trend of new segment and compare with neighboring trends
If trends similar then merge
Store total cost of series E(k) and number segments n_seg(k)
Step 4
Go to Step 2, unless a predefined number of segments are reached
Step 5
Selection of the optimal number of segments
277
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
The looping stage of the algorithm stops when the total number of segments reaches
below 10% of the number segments estimated in Step 1 (n_seg(1)). This was set to
speed up the algorithm, since it was empirically found that financial series require a
relatively high number of segments due to their complex characteristics. The
optimal number of segments is found from a trade-off the number of segments and
the total cost of the series, using their combinatory expression. These two indices
are scaled to vary between [0,1], which ensures the validity of the scheme to any
sort of data. The finally selected number is the one that minimizes (3).
max(n _ seg )  n _ seg
max(n _ seg )  min(n _ seg )
max( E )  E
E_n 
max( E )  min( E )
Comb  n  E _ n
n
(3)
C. SYMBOLIC REPRESENTATION
The finally determined segments, s, are then classified into a finite number of
distinct classes or alphabet, Cs, which facilitates the process of rule extraction.
There are two main procedures for achieving this, heuristically and clustering (Das
1998). The size of alphabet should be selected taking into consideration the total
number of series segments. In this work, the alphabet size was directly related to the
financial nature of the examined series. Heuristically, five classes were chosen to
represent different market states. Additional information included was the duration
of the segments, classified into three types (Table 1).
TABLE 1. ALPHABET SELECTION
Series Values
Segment Duration
Class
Symbol
Market
Class
Symbol
Very Negative
VN
Bearish
Short, DR ≤ 5 days
S
Negative
N
Decrease
Biweekly, 5 < DR ≤ 10 days
B
Constant
C
Stable
Long, DR > 10 days
L
Positive
P
Increase
Very Positive
VP
Bullish
D. RULE EXTRACTION
A rule extraction algorithm is derived to estimate association rules that describe
sequential characteristics of the estimated classes, as selected in the previous step.
The association rules are of the generalised form.
278
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
C(s-j) … C(s-2) C(s-1)  C(s) … C(s+k)
(5)
The antecedent part or Left Hand Side (LHS) contains historic information about
the classes, C, ordered in a sequential manner. The consequent part or Right Hand
Side (RHS) contains future information of the series. Two measures are used to
determine the strength of the rule, using information contained in the data set:

Support is the number of instances that the rule appears within the data set.

Confidence is the accuracy with which the rule predicts correctly, i.e. that the
LHS leads to the RHS.
These measures, which are user-defined parameters, are a measure of the frequency
that a formation appears in the data set and its predictability. The total number of
possible rules is the number of different classes to the power of the combined LHS
and RHS dimensions.
FIGURE 2. RULE EXTRACTION ALGORITHM
Define: minimum support and confidence levels
Step 1
Select LHS as C(s-1)
Step 2
Set flag (for all segments) = true
Loop all segments (s)
Find first true flag and add as a new rule
Find remaining data following specified rule
Set flag(data) = false
Step 3
Remove rules < support threshold
Remove rules < confidence threshold
Save final set of rules
Step 4
If no rules are found
End algorithm
Else
Add previous class to LHS
Go to Step 2
Previous attempts of solving the problem of mining predictive rules from time
series can be classified into two main types. Supervised methods, where the targetrule form is known in advance and used as input to the rule extraction algorithm.
Thus the objective of the analysis becomes to generate rules for predicting these
events based on the data available before the event occurred (Weiss, 1998, Hetland,
2002). In the second type, unsupervised methods, the inputs to the algorithms are
only the data. The goal is to automatically extract informative rules from the series.
In most cases this means that the rules should have some level of preciseness, be
representative of the data, easy to interpret, and interesting to the financial analyst
(Freitas, 2002). The most celebrated approach in the literature (e.g. Agrawal, 1995;
Mannila,,1997; Keogh, 2002) is to sequentially scan all the available data and add
up every occurrence of a rule, and of every antecedent and consequent part. This
counting makes it possible to calculate the frequency of appearance and confidence
279
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
of each rule, in the cost of limiting the rule format. The developed rule extraction
algorithm is based on a sequential database scan, but the search in each loop is done
with decreasing number of samples. The entire process is iterated with an
increasing adjacent part until the support of all estimated rules are below a
predefined threshold value. The finally presented rules are those above a specified
support and confidence threshold value. These values are selected to extract only
frequent patterns with high probability that the consequent part will appear in the
future, enabling profitable trading decisions.
FIGURE 3. (A) DETAIL OF SEGMENTED SP500 SERIES AND (B) THE OPTIMIZATION
PHASE
1100
Index Value
1000
900
800
700
600
4500
4550
4600
4650
Days
1
Scaled Values
0.8
Combined
0.6
Number of Segments
0.4
0.2
0
0
280
Total Cost
500
1000
1500
2000
Iterations
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
2. APPLICATION TO THE SP500 DAILY INDEX
The first series that the methodology was applied to is the daily closing values of
the SP500 index. The data covered a period of 19 years including only working
days, from the first trading day of 1985 to January 27, 2004. Initially, a low-pass
filter was applied with a cut-off frequency value of 0.1 (1/day), which is
approximately a period of two working days. The minimum segment length, dmin,
was set to three, including end-points, and the trend-matching threshold, εt, to
0.3%. A detail of the resulting segmented series is presented in Figure 3a, alongside
the graph showing the scaled values of the total cost, E, and the number of
segments (3b)
The derived segments were characterised by their trend defined as the percentage
x
 x start
change between the start and end points as: tr  100 * end
. The trends of
x start
each segment were classified into five different classes, Table 2, as
TABLE 2. TREND CLASSIFICATION FOR THE SP500 SERIES
VN ≤ -4%
-4% < N ≤ -1%
-1% < C ≤1%
1% < P ≤ 4%
VP > 4%
These values were selected so that the classes contain roughly the same number of
data. The threshold values for the rule extraction algorithm were set as 5 values for
support and 70% for confidence. The later ensured that the extracted rules would
lead to profitable decisions. Two different types of rules were extracted. Initially,
the antecedent part contained only the trend type class, whilst in the second it
additionally contained the duration class of the segment. The results from both
cases are summarized in Tables 3 and 4.
TABLE 3. SP500 SERIES RULES USING TREND
ID
R1
R2
R3
R4
Rule
TR(s-4): VP & TR(s-3): N & TR(s-2): P & TR(s-1): C
 TR(s): P
TR(s-4): C & TR(s-3): VN & TR(s-2): P & TR(s-1): N
 TR(s): P
TR(s-5): N & TR(s-4): P & TR(s-3): N & TR(s-2): P
& TR(s-1): P  TR(s): C
TR(s-5): P & TR(s-4): N & TR(s-3): P & TR(s-2): P
& TR(s-1): C  TR(s): N
Support
6
Confidence
100%
5
71.4%
5
83.3%
7
77.7%
The first rule (R1) is translated, as segments with strong increase followed by
relatively milder decrease and increase respectively and then a period of stable
market behaviour will be followed by increasing values. This rule appeared
281
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
collectively six times in the examined series, all in periods with upward trend, and
all of them resulted in the same consequent characteristics. Rule 2 shows a stable
market followed by strong decrease is and then by a succession of positive and
negative behaviour. This rule mostly appears as the first half of a peak starting
cyclical behaviour. Rule 3 shows a succession of segments with negative and
positive trends ending in stable market behaviour when two positive periods, with
different trend, appear consecutively. Rule 4 is partly the continuation of the
previous rule exhibiting a decrease in the future value of the trend, but also showed
up independently in more than half of appearances.
FIGURE 4. EXAMPLES OF R1-R4 IN THE SP500 SERIES
860
840
Index Value
820
800
780
760
740
3040
3060
3080
Days
3100
3120
3140
1160
1140
1120
Index Value
1100
1080
1060
1040
1020
1000
980
4300
4350
4400
Days
282
4450
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
600
Index Value
590
580
570
560
550
540
2670
2680
2690
Days
2700
1240
Days
1245
2710
2720
350
Index Value
345
340
335
330
1220
1225
1230
1235
1250
1255
1260
TABLE 4. SP500 SERIES RULES USING TREND AND DURATION
ID
R5
R6
R7
R8
R9
Rule
seg(s-2): VP_B & seg(s-1): N_S  seg(s): P
seg (s-2): N_B & seg (s-1): P_B  seg (s): N
seg (s-3): N_B & seg (s-2): P_B & seg (s-1): N_S
 seg (s): P
seg (s-3): VN_S & seg (s-2): C_S & seg (s-1): P_S
 seg (s): N
seg (s-3): C_S & seg (s-3): N_S & seg (s-2): C_S
& seg (s-1): P_S  seg (s): N
Support
10
6
5
Confidence
71.4%
75%
71.4%
5
71.4%
5
71.4%
In the next stage, the analysis included the duration of the antecedent part as
additional information, in attempt to gather more information the behaviour and
transitional characteristics of the series. Rule five (R5) shows that biweekly periods
with strong increase that are followed by short periods of decrease are more likely
to results in increasing future behaviour. This rules is associated mainly with bullish
market periods. The next rule, R6, is a succession of negative and positive periods
lasting biweekly periods, which leads to a decrease. Rule 7 represents periods
where biweekly negative and positive periods followed by short decrease will
mostly be followed by increasing behaviour. The last two rules appear on the first
12 years the examined series. Rule 8, R8, shows a succession of bearish, stable and
positive market characteristics, each lasting less than a week, is followed by
negative trend. On the majority of the appearances of this rule the antecedent part
forms a complete cycle, and the consequent the start of a second one.
283
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
A short decrease is followed by biweekly stable periods are more likely to be
followed by an increase in the series price. The antecedent part of R9 is a full cycle
last more than three weeks in total, which is followed by the start of a new cycle.
FIGURE 6. EXAMPLES FROM R5, R6, R7 AND R8
1130
1120
1110
Index Value
1100
1090
1080
1070
1060
1050
1040
1030
3330
3335
3340
Days
3345
3350
465
Index Value
460
455
450
445
2180
284
2185
2190
2195
Days
2200
2205
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
422
420
Index Value
418
416
414
412
410
408
406
1845
1850
1855
1860
Days
1865
1870
1875
1340
Index Value
1320
1300
1280
1260
1240
3732
3734
3736
3738
3740 3742
Days
3744
3746
3748
3. APPLICATION TO THE DAILY POUND/DOLLAR EXCHANGE RATE
The developed methodology is applied to the daily GB pound to US dollar
exchange rate series. The data covered a period of approximately 32 years from the
4th January 1971 to December 23, 2003. The exchange rates series is influenced by
a number of factors, due to its significance as a policy tool for dealing with
macroeconomic issues. Furthermore, there are many ways that central banks
intervene in the foreign exchange, which induces additional features in the series
(Taylor). Initially, a low-pass filter was applied with a cut-off frequency value of
0.07 (1/day), which is approximately a period of two weeks. The minimum segment
length, dmin, was set to five and the trend-matching threshold, εt, to 0.25%. The
optimization stage returns 763 segments (Figure 7).
285
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
FIGURE 7. FOREIGN EXCHANGE SERIES AND ITS SEGMENTATION
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
The derived segments were characterised by their trend again defined in percentage
terms. A five letters alphabet was chosen, Table 5. For the trend classification, the
three classes described in Table 1 were selected.
TABLE 5. TREND CLASSIFICATION FOR THE GBP/USD SERIES
VN ≤ -2%
-2% < N ≤ -0.5%
-0.5% < C ≤0.5%
0.5% < P ≤ 2%
VP > 2%
The threshold values for the rule extraction algorithm were set as 3 values for the
support and 70% for the confidence parameters. Again the same two types of rules
were extracted, with the antecedent part contained trend type class and the duration
of the segments. Results from both cases are summarized in Tables 6 and 7.
TABLE 6. GBP/USD RULES USING SEGMENT TREND
ID
R1
R2
R3
R4
R5
286
Rule
TR(s-4): VN & TR(s-3): N & TR(s-2): P & TR(s-1): VN
 TR(s): VN
TR(s-4): P & TR(s-3): N & TR(s-2): VN & TR(s-1): P
 TR(s): VN
TR(s-4): VN & TR(s-3): P & TR(s-2): VN & TR(s-1): P
 TR(s): VN
TR(s-4): C & TR(s-3): VP & TR(s-2): C & TR(s-1): N
 TR(s): P
TR(s-5): P & TR(s-4): N & TR(s-3): P & TR(s-2): N
& TR(s-1): VN  TR(s): C
Support
3
Confidence
100%
3
75%
3
75%
3
75%
3
75%
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
Rules 1 to 3 represent a periods of decreasing trend, ending in very negative market
characteristics (Figure 8). Rules 1 appears at periods with strong increasing trends,
whilst R2 the latter mainly appears as the falling part of cycles lasting
approximately three months. R3 appears at non well-defined periods of the data.
Rule 4 shows a general upward trend with stable and mildly decreasing
characteristics. Rule 5 shows a succession of positive and negative periods ending
in low fluctuating markets.
FIGURE 8. EXAMPLES FROM R1, R2 AND R3
2.38
2.36
2.34
2.32
2.3
2.28
2.26
2.24
2.22
2.2
2.18
750
760
770
780
790
800
1.96
1.94
1.92
1.9
1.88
1.86
1.84
1.82
1840
1850
1860
1870
1880
1890
1900
1910
1.58
1.56
1.54
1.52
1.5
1.48
1.46
3140
3150
3160
3170
3180
3190
287
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
In the next stage, the analysis included the duration of the antecedent part as
additional information, in attempt to gather more information the behaviour and
transitional characteristics of the series (Table 7). A set of three rules (R6-R8)
shows periods ending in strong increasing characteristics (Figure 9). R6 shows the
formation of a cycle that starts with long periods of negative and stable
characteristics, followed by sharp increase. R7 captures the duration of a whole
cycle whose antecedent part is approximately three weeks. Rule 8 shows strong
increasing behaviour interrupted by a very short decreasing period. Another set of
rules with bigger LHS is associated with decreasing consequent behaviour (Figure
10). Rule 9 represents the behaviour of the series period after a cycle peak. Rule 10
is a succession of decreasing and increasing periods followed by decreasing
behaviour. This rule appears at irregular instances in the period. Finally, R11 is a
succession of biweekly trends that appears mostly on periods of low fluctuations.
TABLE 7. GBP/USD EXTRACTED RULES USING TREND AND DURATION
INFORMATION
ID
R6
R7
R8
R9
R10
R11
288
Rule
seg (t-2): N_L & seg (t-1): C_L  seg (t): VP
seg (t-2): VN_S & seg (t-1): C_B  seg (t): VP
seg (t-2): VP_L & seg (t-1): N_S  seg (t): VP
seg (t-3): P_L & seg (t-2): VP_B & seg (t-1): C_B  seg (t): N
seg (t-3): N_B & seg (t-2): P_B & seg (t-1): N_L  seg (t): VN
seg (t-3): P_B & seg (t-2): N_B & seg (t-1): P_B  seg (t): N
Support
3
3
3
3
5
3
Confidence
75%
100%
75%
75%
71.4%
100%
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
FIGURE 9. EXAMPLES OF R6, R7 AND R8
1.65
1.64
1.63
1.62
1.61
1.6
1.59
1.58
1.57
1.56
1.55
6700
6705 6710
6715 6720
6725 6730 6735 6740
1.67
1.66
1.65
1.64
1.63
1.62
1.61
1.6
1.59
3094 3096 3098 3100 3102 3104 3106 3108 3110 3112
1.54
1.52
1.5
1.48
1.46
1.44
4165
4170
4175
4180
4185
4190
4195
4200
289
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
FIGURE 10. EXAMPLES FROM R9, R10, R11
2.56
2.54
2.52
2.5
2.48
2.46
570
580
590
600
610
620
1.9
1.88
1.86
1.84
1.82
1.8
4510 4515 4520 4525 4530 4535 4540 4545 4550 4555
1.64
1.63
1.62
1.61
1.6
1.59
1.58
1.57
1.56
6315
6320
6325
6330
6335
6340
6345
6350
4. CONCLUSIONS
In this paper we developed a methodology for the detection and extraction of
similar subsequences of the data that appear in a time series. The methodology is
designed to produce easily interpretable results with only a small number of
parameters that needed to be defined. The main stages of the algorithm are the
series representation using a bottom up segmentation algorithm extended to account
for matching joining segments with similar trends, and a modified sequential rule
extraction algorithm. Furthermore, it addresses one of the most overlooked issues in
similar studies, that is the optimal selection of the number of segments.
290
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
The approach is tested on two financial series the daily closing values of the SP500
index and the GB Pound to US Dollar exchange rate. Adding to trend type
representation of the series, the duration of the trend was included. Setting a
confidence threshold parameter as 70%, three and five rules were identified for the
two series respectively using only trend type information. The number of rules
increased to five and six respectively when duration was used.
REFERENCES
Alexander, S.T., 1986. Adaptive signal processing: Theory and applications,
Berlin: Springer Verlag.
Allen, F. and R. Karjalainen, 1999. Using generic algorithms to find technical
trading rules, Journal of Financial Economics, 51, 245-271.
Agrawal, R., C. Faloutsos and A.N. Swami, 1993. Efficient Similarity Search In
Sequence Databases. In D. Lomet, (ed.), Proceedings of the 4th International
Conference of Foundations of Data Organization and Algorithms (pp. 69–84),
Chicago: Springer Verlag.
Agrawal, R. and R. Srikant, 1995. Mining sequential patterns, in Proceedings
ICDE, (pp. 3–14).
Bessembinder, H. and K. Chan, 1998. Market efficiency and the returns to
technical analysis, Financial Management, 27, 5-17.
Brock, W., J. Lakonishok and B. LeBaron, 1992. Simple technical trading rules
and the stochastic properties of stock returns, Journal of Finance, 47(5), 17311764.
Bulkowski, T.N., 2000. Encyclopedia of chart patterns, John Wiley and Sons.
Caraca-Valente, J. P. and I. Lopez-Chavarrias, 2000. Discovering similar
patterns in time series. In proceedings of the 6th ACM SIGKDD Int'l Conference on
Knowledge Discovery and Data mining (pp 497-505).
Chan, L. and J. Lakonishok, 1996. Momentum strategies, Journal of Finance, 51,
1681-1713.
Das, G., K.I. Lin, H. Manilla, G. Renganathan and P. Smyth, 1998. Rule
discovery from time series, Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining, (pp. 16-22).
Faloutsos, C., M. Ranganathan and Y. Manolopoulos, 1994. Fast subsequence
matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD
International Conference on Management of Data, (pp. 419–429).
Fang, Y. and D. Xu, 2003. The predictability of asset returns: an approach
combining technical analysis and time series forecasts, International Journal of
Forecasting, 9(3), 369-385.
Freitas, A.A., 2002. Data mining and knowledge discovery with evolutionary
algorithms, Berlin: Spinger-Verlag.
Hetland, M.L., 2004. A survey of recent methods for efficient retrieval of similar
time sequences. In Mark Last, Abraham Kandel, and Horst Bunke, (eds), Data
Mining in Time Series Databases. World Scientific.
Hetland, M.L. and P. Sætrom, 2002. Temporal Rule Discovery using Genetic
Programming and Specialized Hardware, in Proceedings Fourth International
Conference on Recent Advances in Soft Computing.
291
EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA
MINING
Keogh, E.J., K. Chakrabarti, M.J. Pazzani, and S. Mehrotra, 2007.
Dimensionality reduction for fast similarity search in large time series databases,
Knowledge and Information Systems, 3(3), 263-286.
Keogh, E.J., S. Chu, D. Hart and M.J. Pazzani, 2004. Segmenting time series: a
survey and novel approach, in Mark Last, Abraham Kandel, Horst Bunke, (eds.)
Data Mining in Time Series Databases, World Scientific Press.
Keogh, E.J., S. Lonardi and B. Chiu, 2002. Finding surprising patterns in a time
series database in linear time and space, Proceedings of the Fourth International
Conference on Knowledge Discovery and Data Mining, (pp. 550–556).
Keogh, E.J. and P. Smyth, 1997. A probabilistic approach to fast pattern matching
in time series databases. In Proceedings of the Third International Conference of
Knowledge Discovery and Data Mining, (pp. 24-30).
Kim, E., J.M. Lam and J. Han, 2000. AIM: approximate intelligent matching for
time series data. In Proceedings of Data Warehousing and Knowledge Discovery,
2nd Int'l Conference (pp 347-357).
Kim, S.W., S. Park and W.W. Chu, 2001. An index based approach for similarity
search supporting time warping in large sequence databases. In ICDE.
Last, M., Y. Klein and A. Kandel, 2001. Knowledge discovery in time series
databases, IEEE Transactions on Systems Man and Cybernetics - B, 31(1), 160-169.
Leigh, W., M. Paz and R. Purvis, 2002. An analysis of a hybrid neural network
and pattern recognition technique for predicting short-term increases in the NYSE
composite index, Omega, 30, 69-76.
Levich, R. and L. Thomas, 1993. The significance of technical trading-rule profits
in the foreign exchange market: A bootstrap approach, Journal of International
Money and Finance, 12, 451-474.
Lo, A.W., H. Mamaysky and J. Wang, 2000. Foundations of technical analysis:
Computational algorithms, statistical inference, and empirical implementation,
Journal of Finance, 55(4), 1705-1765.
Lu, H., J. Han and L. Feng, 1998. Stock movement prediction and N-dimensional
inter-transaction association rules, Proceedings of 1998 SIGMOD, (pp. 12:1-12:7).
Lin, J., E.J. Keogh, S. Lonardi and P. Patel, 2002. Finding Motifs in Time
Series. In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. (pp. 53-68).
Mannila, H., H. Toivonen and A.I Verkamo, 1997. Discovery of frequent
episodes in event sequences, Data Mining and Knowledge Discovery, 1(3), 259–
289.
Murphy, J., 1986. Technical analysis of future markets. New York: Prentice-Hall.
Park, S., W.W. Chu, J. Yoon and C. Hsu, 2000. Efficient searches for similar
subsequences of different lengths in sequence databases. In Proceedings of the 16th
Int'l Conference on Data Engineering, (pp 23-32).
Perng, C.S., H. Wang, S.R. Zhang and D.S. Parker, 2000. Landmarks: a new
model for similarity-based pattern querying in time series databases. In ICDE,
(pp.33–42).
Povinelli, R. and X. Feng, 2003. A new temporal pattern identification method for
characterisation and prediction of complex time series events, IEEE Transactions
on Knowledge and Data Engineering, 15(2), 339- 352.
292
STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS
Sarker, B.K., T. Mori, T. Hirata and K. Uehara, 2003. Parallel algorithms for
mining association rules in time series data. Lecture Notes in Computer Science,
2745, 273 – 284.
Taylor, M.P., 1995. The economic of exchange rates, Journal of the Economic
Literature, 33, 13-47.
Weiss, G.M. and H. Hirsh, 1998. Learning to predict rare events in event
sequences, in Fourth International Conference Knowledge Discovery and Data
Mining, (pp. 359–363).
Weiss, G.M. and N. Indurkhya, 1998. Predictive data mining: A practical guide.
San Fransisco: Morgan Kaufmann.
Welch, P.D., 1967. The use of fast fourier transform for the estimation of power
spectra: A method based on time averaging over short, modified periodograms,
IEEE Transactionson Audio Electroacoustics, 15, 70-73.
Yi, B.K. and C. Faloutsos, 2000. Fast Time Sequence Indexing for Arbitrary Lp
Norms. VLDB 2000, (pp. 385-394).
Zeng, Z., H. Yan and A.M.N. Fu, 2001. Time-series prediction based on pattern
classification, Artificial Intelligence in Engineering, 15(1), 61-69.
293