Download extracting formations from long financial time series using data mining

BRUSSELS ECONOMIC REVIEW – CAHIERS ECONOMIQUES DE BRUXELLES VOL. 53 N° 2 SUMMER 2010 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING STELLA KARAGIANNI1, THANASIS SFETSOS2 3 AND COSTAS SIRIOPOULOS ABSTRACT: Technical analysis has become a custom decision support tool for traders and analysts, though not widely accepted by the academic community. It is based on the identification of a series of well-defined formations appearing over irregular intervals. The same principle forms the basis for the application of data mining methodologies as a tool to discover hidden patterns that exist in a time series, which is achieved by a detailed breakdown of historic information. This paper introduces a methodology for the discovery of formations that exist within a time series and have high probability of reoccurrence. The methodology was developed in an efficient manner requiring only a small number of user-specified parameters. Its two main stages are (a) a modified bottom-up segmentation algorithm with an optimization stage to reach the optimal number of segments, and (b) a rule extraction algorithm. The developed methodology is tested on two major financial series, the daily closing values of the SP500 Index and the GB Pound to US Dollar exchange rates. JEL-CLASSIFICATION: C22, C63, F31. KEYWORDS: technical analysis, data mining, exchange rates, stock market, pattern recognition, rule extraction. 1 2 3 Department of Economics, University of Macedonia, 156, Egnatia st., 54006 Thessaloniki, Greece. EREL, INTRP, NCSR Demokritos, 15310 Aghia Paraskevi, Greece. Email: [email protected] Dept. of Business Administration, University of Patras, 26500, Patra, Greece. Email: [email protected] 273 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING INTRODUCTION Recent empirical studies in the international literature have documented substantial evidence on the predictability of asset returns and foreign exchange markets using technical analysis trading rules (Brock et al, 1992; Levich & Thomas, 1993; Chan et al, 1996; Bessembinder & Chan, 1998; Allen & Karjalainem, 1999; Lo et al, 2000; Fang & Xu, 2003). These studies report that asset returns are correlated, and hence, some degree of predictability can be captured, by technical trading rules, by time series models or their combination. In this work a data-driven methodology is developed to discover formations, of non-constant length, that appear in a univariate financial data set. The methodology was developed to require only a limited number of parameters from a potential future user. Technical analysis is defined as the attempt to identify regularities in the time series of price and volume information from a financial market, by extracting patterns from noisy data (Lo, 2000). Analysts use price charts to study market action as a means to predict future price trends. They believe that prices move in trends, and that history (price patterns) repeats itself (Murphy, 1986). Price patterns are pictures or formations, that can be classified into different categories, and have predictive value. Although there are potentially an infinite number of price formations, the technical analysis literature (Bulkowski, 2000) suggests that certain price formations are widely identified in stock and foreign exchange markets. A step beyond technical analysis is the application of data mining to identify previously unknown formations in a time series. This technique also provides the opportunity to generate association rules that linguistically describe the examined process. Weiss and Indurkhya (1998) define data mining as “the search for valuable information in large volumes of data. Predictive data mining is a search for very strong patterns in big data sets that can generalize to take accurate future decisions”. There are two main types of data mining applications in finance. The first is that some predefined structure of the series exists and an indexing approach aims to identify similar behaviour within the dataset. The alternative is to apply theses methodologies to discover previously unknown patterns within the series and examine reoccurrence characteristics for prediction purposes. In the literature there are several techniques developed to analyse financial data and extract information about their future behaviour. Zeng et al (2001) proposed a hybrid approach for the prediction of stock market trend using pattern recognition and classification. A set of vectors describing future patterns were introduced which were classified based with a probabilistic relaxation algorithm. The approach was tested on the Air-Quantas stock price and a 68% success was reported. Leigh (2002) introduced a method for template matching to predict future trends of a series. The nature of the produced results resembled those produced from technical analysis. On the NYSE composite index he reported similar success in the prediction of a 5day pattern. Das et al (1998) proposed an approach for the discovery of rules from sequential observations, using discrete representation. The rules included local relationships in the series using a finite number of primitive shapes, initially found from a greedy 274 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS clustering algorithm. Experimental results were presented for ten companies of the NASDAQ stock market for a period of 19 months. Lu et al. (1998) studied stock series for extracting inter-transaction association rules. They defined events on the time scale and tried to extract relations between similar patterns. The algorithm was limited to a fixed "sliding window" time interval. Only associations among the events within the same window were extracted. A demonstration of a single dimensional database of a stock market was given, but without prediction results. Last et al (2001) introduced a general methodology for knowledge discovery in time series databases. The process included cleaning of the data, identifying the most important predicting attributes, and extracting a set of association rules to predict the series future behaviour. The computational theory of perception was used to reduce the set of extracted rules by fuzzification and aggregation. Povinelli and Feng (2003) proposed a method that identified predictive temporal structures in reconstructed phase spaces. A genetic algorithm was employed to search such phase spaces for optimal heterogeneous clusters that are predictive of the desired events. Their approach was able to classify time series events more effectively compared to Time Delay Neural Networks and the C4.5 algorithm. This works introduces a two-stages methodology to identify formations of nonconstant length appearing within a financial time series. Initially, the series is represented in linear segments using a modified bottom-up approach. Furthermore, one of the most vague issues in similar studies, the determination of the optimal number of segments is addressed. Then, a rule extraction algorithm is applied to estimate those series formations with a higher probability of reoccurrence. The proposed approach is applied on two major financial series, the daily closing values of the SP500 Index and the GB Pound to US Dollar exchange rate. I. METHODOLOGY In the literature there are a number of authors discussing the issue of similarity for time series segments. The problem is defined in Hetland (2004) as: “For a sequence q, find the nearest ones from a set x. These are defined as those whose distance, using any selected measure, from the original is below a predefined tolerance threshold.” The problem gets further complicated when the original segments are not previously known and the examined segments are of non-constant length. The target of the developed methodology is to identify patterns or formations, of a time series that have high probability of reoccurrence, without any previous knowledge about their characteristics. It is designed to produce fast and easily interpretable results with only a small number of parameters that need to be selected. Furthermore, as will be discussed in the following paragraphs, all parts of the methodology present an improvement over previous research (Kim et al, 2000; Park et al, 2000; Caraca-Valente & Lopez-Chavarrias, 2000; Last et al, 2001; Lin et al, 2002; Povinelli & Feng, 2003; Sarker et al, 2003). In this work the developed methodology is tested to financial data, although its implementation to type of data is straightforward. The various stages of the process are:  Data Cleaning. A suitably selected low pass filter is used to reduce the amount of noise present in the series. 275 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING  Piecewise Linear Segmentation. The bottom up segmentation algorithm is applied with the modification of joining nearest segments if they have similar trends. An optimization phase is included to identify the optimum number of segments.  Symbolic Representation. The resulting segments of the series are then classified into a finite alphabet.  Rule Extraction. Finally, the sequential characteristics of the alphabet are extracted in the form of symbolic rules. A. DATA CLEANING Initially, the original time series was smoothed using a low-pass digital FIR filter, (1). The signal contains lower amount of noise which brings to light characteristics such as trend. The coefficients, ακ and bκ, are estimated so that the resulting signal has the desired behaviour in terms of frequency response. The low-pass filter cutoff frequency was determined using Welch’s (1967) power spectral density algorithm. N M k 0 k 1 yt   ak xt k   bk yt k (1) B. LINEAR SEGMENTATION WITH MATCHING TRENDS The epicentre of all subsequent matching approaches is the efficient representation of the time series, which substantially speeds up the rule extraction algorithm. Argawal (1993) gathered information about the characteristics of a series using the F-index methodology, later evolved by Faloutsos (1994) into the ST-index to account for subsequence matching. Yi and Faloutsos (2000) and Keogh et al (2001) derived independently a piecewise constant approximation approach to divide each sequence into equally spaced segments, using the average value of each as a signature. Keogh and Smyth (1997) applied a probabilistic landmark-based technique, where the extracted features are characteristic parts of the sequence. Perng et al (2000) applied a similar approach that identifies points of “great importance” as landmark features. The specific form used defined landmarks as points where the approximation function’s nth derivative is zero. Kim et al (2001) used as landmarks the boundaries of the segments as well as minimum and maximum points. Many researchers attempt to capture the main characteristics of the data using piecewise linear approximation. There are three main algorithms applied for this task: sliding windows, top-down and bottom-up, which are combined with two alternatives approximations: interpolation and regression. For an excellent review and performance comparison of these approaches the reader is directed to the work of Keogh at al (2004). The authors also proposed the SWAB algorithm, a combination of sliding windows and bottom-up, which was empirically shown to be superior to any other examined. 276 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS The proposed algorithm is a modified bottom up algorithm with a linear interpolation between end-points, also including an extra phase that forces the merging of nearest segments if their trends are similar, (Figure 1). Two trends are presumed similar if their difference is below a predefined threshold. Furthermore, we address one of the most vague issues in similar studies, the selection of an optimum number of segments. Thus, the only parameters that need to be selected are the minimum number of data contained in each segment and the trend similarity threshold. This algorithm requires the use of a cost function. It assists in the process of selecting the appropriate segments to merge, and is the critical measure for the selection of the optimal number of segments. The selected cost function in this work is the squared sum of the Euclidean distance of all members of a segment from the linear interpolant between the end points, i.  k  In the case of merging segments e    dist ( y j  I k )     j 1  In case of total series cost E     2  dist ( y j  I k )   j 1  k (2a) 2   all segments (2b) FIGURE 1. BOTTOM-UP ALGORITHM WITH MATCHING TRENDS Define: minimum segment length, dmin and trend matching threshold εt. Step 1 Split series into segments with length dmin, Assign any remain data to the first or last segment Merge nearest segments if their trend is similar Store number of total segments n_seg(1) , and total cost of series E(1) Step 2 Loop all segments Estimate the merging cost of two nearby segments, e Find segments with minimum e and merge Step 3 Estimate trend of new segment and compare with neighboring trends If trends similar then merge Store total cost of series E(k) and number segments n_seg(k) Step 4 Go to Step 2, unless a predefined number of segments are reached Step 5 Selection of the optimal number of segments 277 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING The looping stage of the algorithm stops when the total number of segments reaches below 10% of the number segments estimated in Step 1 (n_seg(1)). This was set to speed up the algorithm, since it was empirically found that financial series require a relatively high number of segments due to their complex characteristics. The optimal number of segments is found from a trade-off the number of segments and the total cost of the series, using their combinatory expression. These two indices are scaled to vary between [0,1], which ensures the validity of the scheme to any sort of data. The finally selected number is the one that minimizes (3). max(n _ seg )  n _ seg max(n _ seg )  min(n _ seg ) max( E )  E E_n  max( E )  min( E ) Comb  n  E _ n n (3) C. SYMBOLIC REPRESENTATION The finally determined segments, s, are then classified into a finite number of distinct classes or alphabet, Cs, which facilitates the process of rule extraction. There are two main procedures for achieving this, heuristically and clustering (Das 1998). The size of alphabet should be selected taking into consideration the total number of series segments. In this work, the alphabet size was directly related to the financial nature of the examined series. Heuristically, five classes were chosen to represent different market states. Additional information included was the duration of the segments, classified into three types (Table 1). TABLE 1. ALPHABET SELECTION Series Values Segment Duration Class Symbol Market Class Symbol Very Negative VN Bearish Short, DR ≤ 5 days S Negative N Decrease Biweekly, 5 < DR ≤ 10 days B Constant C Stable Long, DR > 10 days L Positive P Increase Very Positive VP Bullish D. RULE EXTRACTION A rule extraction algorithm is derived to estimate association rules that describe sequential characteristics of the estimated classes, as selected in the previous step. The association rules are of the generalised form. 278 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS C(s-j) … C(s-2) C(s-1)  C(s) … C(s+k) (5) The antecedent part or Left Hand Side (LHS) contains historic information about the classes, C, ordered in a sequential manner. The consequent part or Right Hand Side (RHS) contains future information of the series. Two measures are used to determine the strength of the rule, using information contained in the data set:  Support is the number of instances that the rule appears within the data set.  Confidence is the accuracy with which the rule predicts correctly, i.e. that the LHS leads to the RHS. These measures, which are user-defined parameters, are a measure of the frequency that a formation appears in the data set and its predictability. The total number of possible rules is the number of different classes to the power of the combined LHS and RHS dimensions. FIGURE 2. RULE EXTRACTION ALGORITHM Define: minimum support and confidence levels Step 1 Select LHS as C(s-1) Step 2 Set flag (for all segments) = true Loop all segments (s) Find first true flag and add as a new rule Find remaining data following specified rule Set flag(data) = false Step 3 Remove rules < support threshold Remove rules < confidence threshold Save final set of rules Step 4 If no rules are found End algorithm Else Add previous class to LHS Go to Step 2 Previous attempts of solving the problem of mining predictive rules from time series can be classified into two main types. Supervised methods, where the targetrule form is known in advance and used as input to the rule extraction algorithm. Thus the objective of the analysis becomes to generate rules for predicting these events based on the data available before the event occurred (Weiss, 1998, Hetland, 2002). In the second type, unsupervised methods, the inputs to the algorithms are only the data. The goal is to automatically extract informative rules from the series. In most cases this means that the rules should have some level of preciseness, be representative of the data, easy to interpret, and interesting to the financial analyst (Freitas, 2002). The most celebrated approach in the literature (e.g. Agrawal, 1995; Mannila,,1997; Keogh, 2002) is to sequentially scan all the available data and add up every occurrence of a rule, and of every antecedent and consequent part. This counting makes it possible to calculate the frequency of appearance and confidence 279 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING of each rule, in the cost of limiting the rule format. The developed rule extraction algorithm is based on a sequential database scan, but the search in each loop is done with decreasing number of samples. The entire process is iterated with an increasing adjacent part until the support of all estimated rules are below a predefined threshold value. The finally presented rules are those above a specified support and confidence threshold value. These values are selected to extract only frequent patterns with high probability that the consequent part will appear in the future, enabling profitable trading decisions. FIGURE 3. (A) DETAIL OF SEGMENTED SP500 SERIES AND (B) THE OPTIMIZATION PHASE 1100 Index Value 1000 900 800 700 600 4500 4550 4600 4650 Days 1 Scaled Values 0.8 Combined 0.6 Number of Segments 0.4 0.2 0 0 280 Total Cost 500 1000 1500 2000 Iterations STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS 2. APPLICATION TO THE SP500 DAILY INDEX The first series that the methodology was applied to is the daily closing values of the SP500 index. The data covered a period of 19 years including only working days, from the first trading day of 1985 to January 27, 2004. Initially, a low-pass filter was applied with a cut-off frequency value of 0.1 (1/day), which is approximately a period of two working days. The minimum segment length, dmin, was set to three, including end-points, and the trend-matching threshold, εt, to 0.3%. A detail of the resulting segmented series is presented in Figure 3a, alongside the graph showing the scaled values of the total cost, E, and the number of segments (3b) The derived segments were characterised by their trend defined as the percentage x  x start change between the start and end points as: tr  100 * end . The trends of x start each segment were classified into five different classes, Table 2, as TABLE 2. TREND CLASSIFICATION FOR THE SP500 SERIES VN ≤ -4% -4% < N ≤ -1% -1% < C ≤1% 1% < P ≤ 4% VP > 4% These values were selected so that the classes contain roughly the same number of data. The threshold values for the rule extraction algorithm were set as 5 values for support and 70% for confidence. The later ensured that the extracted rules would lead to profitable decisions. Two different types of rules were extracted. Initially, the antecedent part contained only the trend type class, whilst in the second it additionally contained the duration class of the segment. The results from both cases are summarized in Tables 3 and 4. TABLE 3. SP500 SERIES RULES USING TREND ID R1 R2 R3 R4 Rule TR(s-4): VP & TR(s-3): N & TR(s-2): P & TR(s-1): C  TR(s): P TR(s-4): C & TR(s-3): VN & TR(s-2): P & TR(s-1): N  TR(s): P TR(s-5): N & TR(s-4): P & TR(s-3): N & TR(s-2): P & TR(s-1): P  TR(s): C TR(s-5): P & TR(s-4): N & TR(s-3): P & TR(s-2): P & TR(s-1): C  TR(s): N Support 6 Confidence 100% 5 71.4% 5 83.3% 7 77.7% The first rule (R1) is translated, as segments with strong increase followed by relatively milder decrease and increase respectively and then a period of stable market behaviour will be followed by increasing values. This rule appeared 281 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING collectively six times in the examined series, all in periods with upward trend, and all of them resulted in the same consequent characteristics. Rule 2 shows a stable market followed by strong decrease is and then by a succession of positive and negative behaviour. This rule mostly appears as the first half of a peak starting cyclical behaviour. Rule 3 shows a succession of segments with negative and positive trends ending in stable market behaviour when two positive periods, with different trend, appear consecutively. Rule 4 is partly the continuation of the previous rule exhibiting a decrease in the future value of the trend, but also showed up independently in more than half of appearances. FIGURE 4. EXAMPLES OF R1-R4 IN THE SP500 SERIES 860 840 Index Value 820 800 780 760 740 3040 3060 3080 Days 3100 3120 3140 1160 1140 1120 Index Value 1100 1080 1060 1040 1020 1000 980 4300 4350 4400 Days 282 4450 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS 600 Index Value 590 580 570 560 550 540 2670 2680 2690 Days 2700 1240 Days 1245 2710 2720 350 Index Value 345 340 335 330 1220 1225 1230 1235 1250 1255 1260 TABLE 4. SP500 SERIES RULES USING TREND AND DURATION ID R5 R6 R7 R8 R9 Rule seg(s-2): VP_B & seg(s-1): N_S  seg(s): P seg (s-2): N_B & seg (s-1): P_B  seg (s): N seg (s-3): N_B & seg (s-2): P_B & seg (s-1): N_S  seg (s): P seg (s-3): VN_S & seg (s-2): C_S & seg (s-1): P_S  seg (s): N seg (s-3): C_S & seg (s-3): N_S & seg (s-2): C_S & seg (s-1): P_S  seg (s): N Support 10 6 5 Confidence 71.4% 75% 71.4% 5 71.4% 5 71.4% In the next stage, the analysis included the duration of the antecedent part as additional information, in attempt to gather more information the behaviour and transitional characteristics of the series. Rule five (R5) shows that biweekly periods with strong increase that are followed by short periods of decrease are more likely to results in increasing future behaviour. This rules is associated mainly with bullish market periods. The next rule, R6, is a succession of negative and positive periods lasting biweekly periods, which leads to a decrease. Rule 7 represents periods where biweekly negative and positive periods followed by short decrease will mostly be followed by increasing behaviour. The last two rules appear on the first 12 years the examined series. Rule 8, R8, shows a succession of bearish, stable and positive market characteristics, each lasting less than a week, is followed by negative trend. On the majority of the appearances of this rule the antecedent part forms a complete cycle, and the consequent the start of a second one. 283 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING A short decrease is followed by biweekly stable periods are more likely to be followed by an increase in the series price. The antecedent part of R9 is a full cycle last more than three weeks in total, which is followed by the start of a new cycle. FIGURE 6. EXAMPLES FROM R5, R6, R7 AND R8 1130 1120 1110 Index Value 1100 1090 1080 1070 1060 1050 1040 1030 3330 3335 3340 Days 3345 3350 465 Index Value 460 455 450 445 2180 284 2185 2190 2195 Days 2200 2205 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS 422 420 Index Value 418 416 414 412 410 408 406 1845 1850 1855 1860 Days 1865 1870 1875 1340 Index Value 1320 1300 1280 1260 1240 3732 3734 3736 3738 3740 3742 Days 3744 3746 3748 3. APPLICATION TO THE DAILY POUND/DOLLAR EXCHANGE RATE The developed methodology is applied to the daily GB pound to US dollar exchange rate series. The data covered a period of approximately 32 years from the 4th January 1971 to December 23, 2003. The exchange rates series is influenced by a number of factors, due to its significance as a policy tool for dealing with macroeconomic issues. Furthermore, there are many ways that central banks intervene in the foreign exchange, which induces additional features in the series (Taylor). Initially, a low-pass filter was applied with a cut-off frequency value of 0.07 (1/day), which is approximately a period of two weeks. The minimum segment length, dmin, was set to five and the trend-matching threshold, εt, to 0.25%. The optimization stage returns 763 segments (Figure 7). 285 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING FIGURE 7. FOREIGN EXCHANGE SERIES AND ITS SEGMENTATION 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 The derived segments were characterised by their trend again defined in percentage terms. A five letters alphabet was chosen, Table 5. For the trend classification, the three classes described in Table 1 were selected. TABLE 5. TREND CLASSIFICATION FOR THE GBP/USD SERIES VN ≤ -2% -2% < N ≤ -0.5% -0.5% < C ≤0.5% 0.5% < P ≤ 2% VP > 2% The threshold values for the rule extraction algorithm were set as 3 values for the support and 70% for the confidence parameters. Again the same two types of rules were extracted, with the antecedent part contained trend type class and the duration of the segments. Results from both cases are summarized in Tables 6 and 7. TABLE 6. GBP/USD RULES USING SEGMENT TREND ID R1 R2 R3 R4 R5 286 Rule TR(s-4): VN & TR(s-3): N & TR(s-2): P & TR(s-1): VN  TR(s): VN TR(s-4): P & TR(s-3): N & TR(s-2): VN & TR(s-1): P  TR(s): VN TR(s-4): VN & TR(s-3): P & TR(s-2): VN & TR(s-1): P  TR(s): VN TR(s-4): C & TR(s-3): VP & TR(s-2): C & TR(s-1): N  TR(s): P TR(s-5): P & TR(s-4): N & TR(s-3): P & TR(s-2): N & TR(s-1): VN  TR(s): C Support 3 Confidence 100% 3 75% 3 75% 3 75% 3 75% STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS Rules 1 to 3 represent a periods of decreasing trend, ending in very negative market characteristics (Figure 8). Rules 1 appears at periods with strong increasing trends, whilst R2 the latter mainly appears as the falling part of cycles lasting approximately three months. R3 appears at non well-defined periods of the data. Rule 4 shows a general upward trend with stable and mildly decreasing characteristics. Rule 5 shows a succession of positive and negative periods ending in low fluctuating markets. FIGURE 8. EXAMPLES FROM R1, R2 AND R3 2.38 2.36 2.34 2.32 2.3 2.28 2.26 2.24 2.22 2.2 2.18 750 760 770 780 790 800 1.96 1.94 1.92 1.9 1.88 1.86 1.84 1.82 1840 1850 1860 1870 1880 1890 1900 1910 1.58 1.56 1.54 1.52 1.5 1.48 1.46 3140 3150 3160 3170 3180 3190 287 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING In the next stage, the analysis included the duration of the antecedent part as additional information, in attempt to gather more information the behaviour and transitional characteristics of the series (Table 7). A set of three rules (R6-R8) shows periods ending in strong increasing characteristics (Figure 9). R6 shows the formation of a cycle that starts with long periods of negative and stable characteristics, followed by sharp increase. R7 captures the duration of a whole cycle whose antecedent part is approximately three weeks. Rule 8 shows strong increasing behaviour interrupted by a very short decreasing period. Another set of rules with bigger LHS is associated with decreasing consequent behaviour (Figure 10). Rule 9 represents the behaviour of the series period after a cycle peak. Rule 10 is a succession of decreasing and increasing periods followed by decreasing behaviour. This rule appears at irregular instances in the period. Finally, R11 is a succession of biweekly trends that appears mostly on periods of low fluctuations. TABLE 7. GBP/USD EXTRACTED RULES USING TREND AND DURATION INFORMATION ID R6 R7 R8 R9 R10 R11 288 Rule seg (t-2): N_L & seg (t-1): C_L  seg (t): VP seg (t-2): VN_S & seg (t-1): C_B  seg (t): VP seg (t-2): VP_L & seg (t-1): N_S  seg (t): VP seg (t-3): P_L & seg (t-2): VP_B & seg (t-1): C_B  seg (t): N seg (t-3): N_B & seg (t-2): P_B & seg (t-1): N_L  seg (t): VN seg (t-3): P_B & seg (t-2): N_B & seg (t-1): P_B  seg (t): N Support 3 3 3 3 5 3 Confidence 75% 100% 75% 75% 71.4% 100% STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS FIGURE 9. EXAMPLES OF R6, R7 AND R8 1.65 1.64 1.63 1.62 1.61 1.6 1.59 1.58 1.57 1.56 1.55 6700 6705 6710 6715 6720 6725 6730 6735 6740 1.67 1.66 1.65 1.64 1.63 1.62 1.61 1.6 1.59 3094 3096 3098 3100 3102 3104 3106 3108 3110 3112 1.54 1.52 1.5 1.48 1.46 1.44 4165 4170 4175 4180 4185 4190 4195 4200 289 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING FIGURE 10. EXAMPLES FROM R9, R10, R11 2.56 2.54 2.52 2.5 2.48 2.46 570 580 590 600 610 620 1.9 1.88 1.86 1.84 1.82 1.8 4510 4515 4520 4525 4530 4535 4540 4545 4550 4555 1.64 1.63 1.62 1.61 1.6 1.59 1.58 1.57 1.56 6315 6320 6325 6330 6335 6340 6345 6350 4. CONCLUSIONS In this paper we developed a methodology for the detection and extraction of similar subsequences of the data that appear in a time series. The methodology is designed to produce easily interpretable results with only a small number of parameters that needed to be defined. The main stages of the algorithm are the series representation using a bottom up segmentation algorithm extended to account for matching joining segments with similar trends, and a modified sequential rule extraction algorithm. Furthermore, it addresses one of the most overlooked issues in similar studies, that is the optimal selection of the number of segments. 290 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS The approach is tested on two financial series the daily closing values of the SP500 index and the GB Pound to US Dollar exchange rate. Adding to trend type representation of the series, the duration of the trend was included. Setting a confidence threshold parameter as 70%, three and five rules were identified for the two series respectively using only trend type information. The number of rules increased to five and six respectively when duration was used. REFERENCES Alexander, S.T., 1986. Adaptive signal processing: Theory and applications, Berlin: Springer Verlag. Allen, F. and R. Karjalainen, 1999. Using generic algorithms to find technical trading rules, Journal of Financial Economics, 51, 245-271. Agrawal, R., C. Faloutsos and A.N. Swami, 1993. Efficient Similarity Search In Sequence Databases. In D. Lomet, (ed.), Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms (pp. 69–84), Chicago: Springer Verlag. Agrawal, R. and R. Srikant, 1995. Mining sequential patterns, in Proceedings ICDE, (pp. 3–14). Bessembinder, H. and K. Chan, 1998. Market efficiency and the returns to technical analysis, Financial Management, 27, 5-17. Brock, W., J. Lakonishok and B. LeBaron, 1992. Simple technical trading rules and the stochastic properties of stock returns, Journal of Finance, 47(5), 17311764. Bulkowski, T.N., 2000. Encyclopedia of chart patterns, John Wiley and Sons. Caraca-Valente, J. P. and I. Lopez-Chavarrias, 2000. Discovering similar patterns in time series. In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data mining (pp 497-505). Chan, L. and J. Lakonishok, 1996. Momentum strategies, Journal of Finance, 51, 1681-1713. Das, G., K.I. Lin, H. Manilla, G. Renganathan and P. Smyth, 1998. Rule discovery from time series, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, (pp. 16-22). Faloutsos, C., M. Ranganathan and Y. Manolopoulos, 1994. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, (pp. 419–429). Fang, Y. and D. Xu, 2003. The predictability of asset returns: an approach combining technical analysis and time series forecasts, International Journal of Forecasting, 9(3), 369-385. Freitas, A.A., 2002. Data mining and knowledge discovery with evolutionary algorithms, Berlin: Spinger-Verlag. Hetland, M.L., 2004. A survey of recent methods for efficient retrieval of similar time sequences. In Mark Last, Abraham Kandel, and Horst Bunke, (eds), Data Mining in Time Series Databases. World Scientific. Hetland, M.L. and P. Sætrom, 2002. Temporal Rule Discovery using Genetic Programming and Specialized Hardware, in Proceedings Fourth International Conference on Recent Advances in Soft Computing. 291 EXTRACTING FORMATIONS FROM LONG FINANCIAL TIME SERIES USING DATA MINING Keogh, E.J., K. Chakrabarti, M.J. Pazzani, and S. Mehrotra, 2007. Dimensionality reduction for fast similarity search in large time series databases, Knowledge and Information Systems, 3(3), 263-286. Keogh, E.J., S. Chu, D. Hart and M.J. Pazzani, 2004. Segmenting time series: a survey and novel approach, in Mark Last, Abraham Kandel, Horst Bunke, (eds.) Data Mining in Time Series Databases, World Scientific Press. Keogh, E.J., S. Lonardi and B. Chiu, 2002. Finding surprising patterns in a time series database in linear time and space, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, (pp. 550–556). Keogh, E.J. and P. Smyth, 1997. A probabilistic approach to fast pattern matching in time series databases. In Proceedings of the Third International Conference of Knowledge Discovery and Data Mining, (pp. 24-30). Kim, E., J.M. Lam and J. Han, 2000. AIM: approximate intelligent matching for time series data. In Proceedings of Data Warehousing and Knowledge Discovery, 2nd Int'l Conference (pp 347-357). Kim, S.W., S. Park and W.W. Chu, 2001. An index based approach for similarity search supporting time warping in large sequence databases. In ICDE. Last, M., Y. Klein and A. Kandel, 2001. Knowledge discovery in time series databases, IEEE Transactions on Systems Man and Cybernetics - B, 31(1), 160-169. Leigh, W., M. Paz and R. Purvis, 2002. An analysis of a hybrid neural network and pattern recognition technique for predicting short-term increases in the NYSE composite index, Omega, 30, 69-76. Levich, R. and L. Thomas, 1993. The significance of technical trading-rule profits in the foreign exchange market: A bootstrap approach, Journal of International Money and Finance, 12, 451-474. Lo, A.W., H. Mamaysky and J. Wang, 2000. Foundations of technical analysis: Computational algorithms, statistical inference, and empirical implementation, Journal of Finance, 55(4), 1705-1765. Lu, H., J. Han and L. Feng, 1998. Stock movement prediction and N-dimensional inter-transaction association rules, Proceedings of 1998 SIGMOD, (pp. 12:1-12:7). Lin, J., E.J. Keogh, S. Lonardi and P. Patel, 2002. Finding Motifs in Time Series. In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (pp. 53-68). Mannila, H., H. Toivonen and A.I Verkamo, 1997. Discovery of frequent episodes in event sequences, Data Mining and Knowledge Discovery, 1(3), 259– 289. Murphy, J., 1986. Technical analysis of future markets. New York: Prentice-Hall. Park, S., W.W. Chu, J. Yoon and C. Hsu, 2000. Efficient searches for similar subsequences of different lengths in sequence databases. In Proceedings of the 16th Int'l Conference on Data Engineering, (pp 23-32). Perng, C.S., H. Wang, S.R. Zhang and D.S. Parker, 2000. Landmarks: a new model for similarity-based pattern querying in time series databases. In ICDE, (pp.33–42). Povinelli, R. and X. Feng, 2003. A new temporal pattern identification method for characterisation and prediction of complex time series events, IEEE Transactions on Knowledge and Data Engineering, 15(2), 339- 352. 292 STELLA KARAGIANNI, THANASIS SFETSOS AND COSTAS SIRIOPOULOS Sarker, B.K., T. Mori, T. Hirata and K. Uehara, 2003. Parallel algorithms for mining association rules in time series data. Lecture Notes in Computer Science, 2745, 273 – 284. Taylor, M.P., 1995. The economic of exchange rates, Journal of the Economic Literature, 33, 13-47. Weiss, G.M. and H. Hirsh, 1998. Learning to predict rare events in event sequences, in Fourth International Conference Knowledge Discovery and Data Mining, (pp. 359–363). Weiss, G.M. and N. Indurkhya, 1998. Predictive data mining: A practical guide. San Fransisco: Morgan Kaufmann. Welch, P.D., 1967. The use of fast fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms, IEEE Transactionson Audio Electroacoustics, 15, 70-73. Yi, B.K. and C. Faloutsos, 2000. Fast Time Sequence Indexing for Arbitrary Lp Norms. VLDB 2000, (pp. 385-394). Zeng, Z., H. Yan and A.M.N. Fu, 2001. Time-series prediction based on pattern classification, Artificial Intelligence in Engineering, 15(1), 61-69. 293

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download extracting formations from long financial time series using data mining