Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Event or State Sequences Mining Event or State Sequences: A Social Science Perspective Gilbert Ritschard Department of Econometrics, University of Geneva http://mephisto.unige.ch IIS 2008, Zakopane, Poland, June 16-18 13/7/2008gr 1/86 Mining Event or State Sequences My talk is about life courses, Example of scientific life course to help you understand what a social scientist does at IIS date 1970-1979 1980-1992 1985-... 1990-1995 2000-... 2003-... 2005-... 13/7/2008gr 2/86 event Studies in econometrics Mathematical Economics Work with Social scientists (Family studies) Interest in Statistics for social sciences Interest in Neural Networks KDD and data mining (Clustering, supervised learning) Work with historians, demographers, psychologists (longitudinal data) KDD and Data mining approaches for analysing life course data Mining Event or State Sequences Outline 1 Sequence Analysis in Social Sciences 2 Survival Trees 3 Visualizing and clustering sequence data 4 Mining Frequent Episodes 13/7/2008gr 3/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation Motivation Individual life course paradigm. Following macro quantities (e.g. #divorces, fertility rate, mean education level, ...) over time insufficient for understanding social behavior. Need to follow individual life courses. Data availability Large panel surveys in many countries (SHP, CHER, SILC, GGP, ...) Biographical retrospective surveys (FFS, ...). Statistical matching of censuses, population registers and other administrative data. 13/7/2008gr 6/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation Motivation Need for suited methods for discovering interesting knowledge from these individual longitudinal data. Social scientists use Essentially Survival analysis (Event History Analysis) More rarely sequential data analysis (Optimal Matching, Markov Chain Models) Could social scientists benefit from data-mining approaches? Which methods? Are there specific issues with those methods for social scientists? 13/7/2008gr 7/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation Motivation: KD in Social sciences In KDD and data mining, focus on prediction and classification. Improve prediction and classification errors. In Social science, aim is understanding/explaining (social) behaviors. Hence focus is on process rather than output. 13/7/2008gr 8/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation What kind of data What kind of data are we dealing with? Mainly categorical longitudinal data describing life courses An ontology of longitudinal data (Aristotelean tree). 13/7/2008gr 9/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation Alternative views of Individual Longitudinal Data Table: Time stamped events, record for Sandra ending secondary school in 1970 first job in 1971 marriage in 1973 Table: State sequence view, Sandra year 1969 1970 1971 1972 1973 civil status single single single single married education level primary secondary secondary secondary secondary job no no first first first 13/7/2008gr 10/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation Issues with life course data Incomplete sequences Censored and truncated data: Cases falling out of observation before experiencing an event of interest. Sequences of varying length. Time varying predictors. Example: When analysing time to divorce, presence of children is a time varying predictor. Data collected by clusters Example: Household panel surveys. Multi-level analysis to account for unobserved shared characteristics of members of a same cluster. 13/7/2008gr 11/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Motivation Multi-level: Simple linear regression example 9 y = 15.6 - 0.8 x 8 y = 12.5 - 0.8 x 7 Children 6 5 4 3 y = 3.2 + 0.2 x 2 y = 6.2 - 0.8 x 1 0 1 3 5 7 9 Education 13/7/2008gr 12/86 11 13 15 Mining Event or State Sequences Sequence Analysis in Social Sciences Methods for Longitudinal Data Classical statistical approaches Survival Approaches Survival or Event history analysis (Blossfeld and Rohwer, 2002) Focuses on one event. Concerned with duration until event occurs or with hazard of experiencing event. Survival curves: Distribution of duration until event occurs S(t) = p(T ≥ t) . Hazard models: Regression like models for S(t, x) or hazard h(t) = p(T = t | T ≥ t) h(t, x) = g t, β0 + β1 x1 + β2 x2 (t) + · · · 13/7/2008gr 14/86 . Mining Event or State Sequences Sequence Analysis in Social Sciences Methods for Longitudinal Data Survival curves (Switzerland, SHP 2002 biographical survey) 1 0.9 Survival probability 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Women 0.1 0 0 10 20 30 40 50 60 70 80 AGE (years) 13/7/2008gr 15/86 Leaving home Last child left Marriage Divorce 1st Chilbirth Widowing Parents' death Mining Event or State Sequences Sequence Analysis in Social Sciences Methods for Longitudinal Data Analysis of sequences Frequencies of given subsequences Essentially event sequences. Subsequences considered as categories ⇒ Methods for categorical data apply (Frequencies, cross tables, log-linear models, logistic regression, ...). Markov chain models State sequences. Focuses on transition rates between states. Does the rate also depend on previous states? How many previous states are significant? Optimal Matching (Abbott and Forrest, 1986) . State sequences. Edit distance (Levenshtein, 1966; Needleman and Wunsch, 1970) between pairs of sequences. Clustering of sequences. 13/7/2008gr 16/86 Mining Event or State Sequences Sequence Analysis in Social Sciences Methods for Longitudinal Data Typology of methods for life course data Issues Questions duration/hazard descriptive • Survival curves: Parametric (Weibull, Gompertz, ...) and non parametric (Kaplan-Meier, NelsonAalen) estimators. causality 13/7/2008gr 17/86 • Hazard regression models (Cox, ...) • Survival trees state/event sequencing • Optimal matching clustering • Frequencies of given patterns • Discovering typical episodes • Markov models • Mobility trees • Association rules among episodes Mining Event or State Sequences Survival Trees The biographical SHP dataset SHP biographical retrospective survey http://www.swisspanel.ch SHP retrospective survey: 2001 (860) and 2002 (4700 cases). We consider only data collected in 2002. Data completed with variables from 2002 wave (language). Characteristics of retained data for divorce (individuals who get married at least once) men women Total Total 1414 1656 3070 1st marriage dissolution 231 308 539 16.3% 18.6% 17.6% 13/7/2008gr 20/86 Mining Event or State Sequences Survival Trees The biographical SHP dataset Distribution by birth cohort 300 0 100 200 Frequency 400 500 Birth year 1910 1920 1930 year 13/7/2008gr 21/86 1940 1950 1960 Mining Event or State Sequences Survival Trees The biographical SHP dataset Marriage duration until divorce 1 1 0.95 0.95 0.9 0.9 0.85 0.85 prob. de surv vie prob. de surv vie Survival curves 08 0.8 0.75 0.7 08 0.8 1942 et avant 1942 1943-19520.75 1943 1953 et après 1953 0.65 0.7 0.65 0.6 0.6 0.55 0.55 0.5 0.5 0 10 20 30 40 0 Durée du mariage, Femmes 1943-1952 1953 et après 13/7/2008gr 22/86 10 20 Durée du mariage, Hommes 1942 et avant 30 40 Mining Event or State Sequences Survival Trees The biographical SHP dataset Marriage duration until divorce Hazard model Discrete time model (logistic regression on person-year data) exp(B) gives the Odds Ratio, i.e. change in the odd h/(1 − h) when covariate increased by 1 unit. birthyr university child language Constant 13/7/2008gr 23/86 unknwn French German Italian exp(B) 1.0088 1.22 0.73 1.47 1.26 1 0.89 0.0000000004 Sig. 0.002 0.043 0.000 0.000 0.007 ref 0.537 0.000 Mining Event or State Sequences Survival Trees Survival Tree Principle Survival trees: Principle Target is survival curve or some other survival characteristic. Aim: Partition data set into groups that differ as much as possible (max between class variability) Example: Segal (1988) maximizes difference in KM survival curves by selecting split with smallest p-value of Tarone-Ware Chi-square statistics X wi di1 − E(Di ) TW = 1/2 i wi2 var(Di ) are as homogeneous as possible (min within class variability) Example: Leblanc and Crowley (1992) maximize gain in deviance (-log-likelihood) of relative risk estimates. 13/7/2008gr 25/86 Mining Event or State Sequences Survival Trees Example Divorce, Switzerland, Differences in KM Survival Curves I R o o t S < 9 0 % a t 1 1 S (3 0 ) = 7 7 % Zoom n = 3 6 1 9 e = 6 2 2 B ir th C o h o r t T W (1 ) = 5 4 .8 , p < .0 0 0 1 £ 1 9 4 0 S < 9 0 % > 1 9 4 0 S < 9 0 % a t 2 1 n = e = n = 2 7 7 8 e = 4 9 9 8 4 1 1 2 3 C h ild L a n g u a g e T W (1 ) = 3 7 .4 , p < .0 0 0 1 T W (1 ) = 2 2 .5 , p < .0 0 0 1 N o n F re n c h S < 9 0 % S < 9 0 % S (3 0 ) = 8 9 % n = e = S < 9 0 % a t 1 1 n = e = N o S < 9 0 % S (3 0 ) = 9 0 % n = e = a t 1 0 S (3 0 ) = 7 6 % 6 1 6 6 7 13/7/2008gr 27/86 S < 9 0 % N o n F re n c h n = e = L 1 5 1 1 2 S < 9 0 % a t 1 3 S (3 0 ) = 7 7 % L 2 n = e = 1 4 4 4 2 1 7 T W (1 ) = 4 .4 5 , p = .0 3 4 9 F re n c h , u n k n w S < 9 0 % a t 8 S (3 0 ) = 7 0 % L 4 n = e = 6 0 3 1 3 8 U n iv e r s ity T W (1 ) = 9 .7 7 , p = .0 0 1 8 Y e s a t 2 9 n = e = L a n g u a g e T W (1 ) = 8 .0 8 , p = .0 0 4 5 a t 5 S (3 0 ) = 6 4 % n = 2 1 7 5 e = 3 6 1 1 7 4 4 4 L 3 U n iv e r s ity S < 9 0 % a t 1 1 S (3 0 ) = 7 5 % S (3 0 ) = 7 4 % 6 6 7 7 9 N o , m is s . Y e s F re n c h a t 2 6 a t 9 S (3 0 ) = 7 3 % S (3 0 ) = 8 6 % 7 3 1 1 4 4 N o S < 9 0 % Y e s a t 6 S (3 0 ) = 6 5 % L 5 n = e = 5 1 7 1 1 5 S < 9 0 % a t 3 S (3 0 ) = 5 9 % L 6 n = e = 8 6 2 3 L 7 Mining Event or State Sequences Survival Trees Example 0.6 0.7 0.8 0.9 1.0 Divorce, Switzerland, Differences in KM Survival Curves II Cohort <=1940 & Non French Speaking & University Cohort <=1940 & Non French Speaking & < University Cohort <=1940 & French Speaking Cohort > 1940 & No Child & University Cohort > 1940 & No Child & < University 0.5 Cohort > 1940 & Child & German or Italian Speaking Cohort > 1940 & Child & French or Unknown Speaking 0 13/7/2008gr 28/86 10 20 30 40 Mining Event or State Sequences Survival Trees Example Divorce, Switzerland, Relative risk R o o t l = 1 n = 3 6 1 9 e = 6 2 2 B ir th C o h o r t D D e v = 5 5 .9 £ 1 9 4 0 > 1 9 4 0 l = 1 .2 l = 0 .6 n = e = n = 2 7 7 8 e = 4 9 9 8 4 1 1 2 3 C h ild L a n g u a g e D D e v = 3 0 .9 D D e v = 1 8 .4 N o n F re n c h F re n c h Y e s N o , m is s . l = 0 .4 8 l = 1 .1 l = 1 .0 6 l = 1 .8 8 n = 2 1 7 5 e = 3 6 1 n = n = e = 13/7/2008gr 29/86 6 6 7 7 9 n = e = 1 7 4 4 4 e = 6 0 3 1 3 8 Mining Event or State Sequences Survival Trees Example Hazard model with interaction Adding interaction effects detected with the tree approach improves significantly the fit (sig ∆χ2 = 0.004) exp(B) 1.78 1.22 0.94 1.50 1.12 1 0.92 Sig. 0.000 0.049 0.619 0.000 0.282 ref 0.677 b_before_40*French b_after_40*child 1.46 0.68 0.028 0.010 Constant 0.008 0.000 born after 1940 university child language 13/7/2008gr 30/86 unknwn French German Italian Mining Event or State Sequences Survival Trees Social Science Issues Issues with survival trees in social sciences 1 Dealing with time varying predictors Segal (1992) discusses few possibilities, none being really satisfactory. Huang et al. (1998) propose a piecewise constant approach suitable for discrete variables and limited number of changes. Room for development ... 2 Multi-level analysis How can we account for multi-level effects in survival trees, and more generally in trees? Conjecture: Should be possible to include unobserved shared effect in deviance-based splitting criteria. 13/7/2008gr 32/86 Mining Event or State Sequences Visualizing and clustering sequence data Life trajectories Sequence analysis Survival approaches not useful in a unitary (holistic) perspective of the whole life course. Sequence analysis of whole collection of life events better suited for such holistic approach (Billari, 2005). Rendering sequences Colorize your life courses Results from the analysis of the retrospective Swiss Household Panel (SHP) survey. Focus on visualization of life course data. 13/7/2008gr 35/86 Mining Event or State Sequences Visualizing and clustering sequence data Life trajectories Evolution tendencies in familial life course trajectories Sequence analysis techniques permit to test hypotheses about evolution in these familial life trajectories. (Elzinga and Liefbroer, 2007): De-standardization: Some states and events of familial life are shared by decreasing proportions of the population, occur at more dispersed ages and their duration is also more scattered. De-institutionalization: Social and temporal organization of life courses becomes less driven by normative, legal or institutional rules. Differentiation: Number of distinct steps lived by individual increases. 13/7/2008gr 36/86 Mining Event or State Sequences Visualizing and clustering sequence data Example: the BioFam sequential data set Presentation of the “BioFam” data Data from the retrospective survey conducted in 2002 by the Swiss Household Panel (SHP) (with support of Federal Statistical Office, Swiss National Fund for Scientific Research, University of Neuchatel.) Retrospective survey: 5560 individuals Retained familial life events: Leaving Home, First childbirth, First marriage and First divorce. Age 15 to 45 → 2601 remaining individuals, born between 1909 et 1957. 13/7/2008gr 38/86 Mining Event or State Sequences Visualizing and clustering sequence data Example: the BioFam sequential data set Distribution by birth cohort 300 0 100 200 Frequency 400 500 Birth year 1910 13/7/2008gr 39/86 1920 1930 1940 1950 1960 Mining Event or State Sequences Visualizing and clustering sequence data Example: the BioFam sequential data set Creating state sequences Example of time stamped data: individual 1 13/7/2008gr 40/86 LHome 1989 marriage 1990 childbirth 1992 divorce NA Mining Event or State Sequences Visualizing and clustering sequence data Example: the BioFam sequential data set Deriving the states Need one state for each combination of events: 0 1 2 3 4 5 6 7 13/7/2008gr 41/86 LHome no yes no yes no yes yes yes/no marriage no no yes yes no no yes yes childbirth no no yes/no no yes yes yes yes/no divorce no no no no no no no yes Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Definition Entropy: measure of uncertainty regarding sequence predictability. pi , proportion of P cases (or time points) in state i. Shannon h(p) = i −pi log2 (pi ) Other type of entropies: Quadratic (Gini), Daroczy, ... Two ways of using entropies. Entropy of the state at each time (age) point: Entropy increases with diversity of states observed at each time point (age). Entropy of each individual sequences: Entropy increases with diversity of states during the observed life course and varies with the time spend in each state. 13/7/2008gr 43/86 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Entropy of the state at each time (age) point 0.4 0.2 Entropy 0.6 0.8 Entropy of bifam state distribution by age a15 13/7/2008gr 44/86 a17 a19 a21 a23 Age a25 a27 a29 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Entropy: Minimum/maximum Sequences 1−15, sorted by Entropy Entropie minimum, médiane et maximum N/N/N/N Y/N/N/N N/Y/*/N Y/Y/N/N N/N/Y/N Y/N/Y/N Y/Y/Y/N */*/*/Y A15 13/7/2008gr 45/86 A20 A25 A30 Time A35 A40 A45 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Entropy - histogram 300 200 0 100 Frequency 400 500 Entropy for the sequences in the biofam data set 0.0 13/7/2008gr 46/86 0.5 1.0 Entropy 1.5 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Hypothesis Evolutions of familial life trajectories gives rise to an increase in the entropy of individual sequences, because they become less predictable and more diversified. 13/7/2008gr 47/86 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Entropy by birth cohorts 1.5 Distribution de l'entropie selon les cohortes de naissances ● ● 1.0 0.5 0.0 Sequences entropy ● 13/7/2008gr 48/86 ● ● ● ● ● ● ● 1909−18 1919−28 1929−38 1939−48 1949−58 Birth cohort Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Entropy by sex 1.0 0.5 0.0 Sequences entropy 1.5 Distribution de l'entropie selon le sexe 13/7/2008gr 49/86 ● ● Hommes Femmes Sexe Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Definition Turbulence (Elzinga and Liefbroer, 2007): Somewhat similar to entropy. Turbulence accounts for state sequencing (which is not the case of the entropy). Turbulence accounts of the following two elements: number of subsequences: x=S,U,M,MC - 16 subsequences more turbulent than y=S,U,S,C - 15 subsequences variance of duration in each state: S/10 U/2 M/132 is less turbulent than S/48 U/48 M/48 13/7/2008gr 50/86 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Turbulence - Minimum/maximum Sequences 1−15, sorted by Turbulence Turbulence minimum, médiane et maximum N/N/N/N Y/N/N/N N/Y/*/N Y/Y/N/N N/N/Y/N Y/N/Y/N Y/Y/Y/N */*/*/Y A15 13/7/2008gr 51/86 A20 A25 A30 Time A35 A40 A45 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Turbulence - histogram 400 0 200 Frequency 600 Turbulence for the sequences in the biofam data set 2 13/7/2008gr 52/86 4 6 Turbulence 8 10 Mining Event or State Sequences Visualizing and clustering sequence data Characteristics of sequences Turbulence by cohorts 10 Turbulence selon la cohorte de naissances ● ● ● ● ● 8 6 4 ● 2 Sequences turbulence ● ● ● ● ● ● ● ● ● 1909−18 13/7/2008gr 53/86 ● ● ● ● 1919−28 ● ● ● 1929−38 1939−48 1949−58 Birth cohort Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering Clustering, Multidimensional scaling and more Once you are able to compute 2 by 2 distances between sequences you can among others: Cluster sequences Make scatter plot representation of sets of sequences using multidimensional scaling. 13/7/2008gr 55/86 Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering Distances between sequences Edit distance (known as Optimal matching in Social sciences) (Levenshtein, 1966; Needleman and Wunsch, 1970; Abbott and Forrest, 1986) d(x , y ) Total cost of insert, deletion and substitution changes required to transform sequence x into y . Different solutions depending on indel and substitution costs. Other metrics proposed by (Elzinga, 2008) LCP: Longest common prefix (also longest common postfix) LCS: Longest common subsequence (same as OM with indel cost = 1, and substitution cost = 2). NMS: Number of matching subsequences ... Elzinga (2008) proposes a nice formalization of these metrics. 13/7/2008gr 56/86 0 200 200 Height 400 Height 400 600 600 800 800 1000 1000 1200 Dendrogram of agnes(x = dist.om1, diss = TRUE, method = "ward") OM1 dist.om1 Agglomerative Coefficient = 1 13/7/2008gr 57/86 1 173 347 849 1081 1100 1192 1488 1752 1783 2205 2259 2382 2589 121 155 285 563 790 796 929 992 1019 1419 1468 2023 2125 130 55 258 534 231 1332 1859 2151 535 1387 1519 737 2467 2 133 142 184 248 637 653 818 889 912 1193 1243 1254 1615 1678 1993 2163 2261 2555 26 59 104 159 172 428 663 860 1014 1452 1485 1559 1620 1663 2267 2525 2554 2584 1149 37 163 195 234 358 362 598 784 813 965 1020 1032 1042 1059 1065 1088 1249 1252 1343 1795 1825 1892 1899 1925 1964 2002 2258 2358 2535 2546 2597 1116 162 297 536 652 856 1238 1244 1510 1552 1554 1609 1727 1738 1787 1945 2048 2257 2335 2373 2457 2496 2497 15 82 129 131 312 660 677 833 905 913 1089 1138 1239 1329 1378 1512 1584 1680 1874 1884 2343 2448 2552 83 91 112 150 152 260 267 282 299 549 599 764 1053 1379 1753 2075 2145 2547 132 2478 1673 1581 1873 1653 30 87 137 235 256 345 364 403 594 907 1092 1284 1476 1489 1526 2047 2207 2272 2349 2361 2396 2596 1728 31 100 220 243 277 281 354 455 460 483 710 814 850 970 1689 1900 2052 2215 2339 2348 2465 2574 2595 2068 35 102 259 266 309 311 429 729 819 837 999 1187 1250 1264 1760 1768 1806 1886 1920 2001 2067 2325 2340 2359 2556 110 1073 1506 360 2036 918 1373 1978 420 1573 1091 1669 2522 626 1061 1877 1322 1924 2507 1204 2223 3 405 528 2098 2365 2366 2477 642 777 1072 1677 384 711 925 1234 1316 1341 1793 1875 1927 842 1377 1915 2454 2097 29 1385 385 417 641 835 1008 1233 2038 2466 2560 242 810 841 982 1156 1297 1436 1672 1683 1763 1798 1862 1990 2483 189 649 1144 2144 945 981 1703 1656 1895 2157 2243 310 530 2327 788 943 2438 952 2074 224 379 503 940 1314 1645 1076 2437 2370 240 2389 1797 2237 2326 395 473 502 561 880 1463 1812 1813 944 2206 2218 2306 1339 2414 76 1535 1654 146 750 1224 1171 1491 1694 1211 1406 316 1143 2394 648 1885 716 1443 2487 2371 759 803 1599 1125 1679 1691 51 2445 2513 1474 2213 344 580 597 1842 177 1133 2279 2053 607 578 1018 827 994 1773 353 579 627 1016 476 1774 2369 61 143 465 472 593 646 755 792 876 1003 1384 1671 1907 2120 2245 2269 2503 169 253 322 877 1004 1056 1071 1397 1713 2128 2220 2281 2482 2568 996 1870 1279 1866 168 2236 442 1356 1755 1937 92 333 464 817 1082 1182 1216 1350 1690 1712 1767 2393 537 413 414 468 763 896 1150 1155 1401 2110 2155 2241 2226 736 1652 1338 1490 401 963 1221 1362 2081 2082 1848 1849 86 559 2381 1442 441 1936 960 2143 2016 1242 830 1976 1398 1982 1313 139 323 348 1386 1692 1112 1355 1423 1467 1957 2372 787 1740 662 1333 2005 1764 1843 283 284 816 924 1154 2055 2127 2476 1361 802 1126 1289 1702 1754 1803 1894 1950 2193 2421 908 2510 42 60 586 2006 487 910 2026 1051 1562 488 585 1829 881 81 887 1017 2105 568 701 241 493 911 665 1952 596 872 2032 196 821 319 1317 238 914 300 1684 365 707 492 2142 2593 74 289 75 2316 635 1400 1633 2190 640 879 1879 1949 111 602 601 1667 1722 1208 1433 979 1977 167 888 499 584 2531 873 1326 197 793 555 1666 324 937 2071 1209 1435 1830 513 1697 1943 2590 794 2521 780 1147 1085 1958 10 38 99 113 164 171 187 212 213 226 228 229 302 304 386 404 427 432 454 484 494 521 523 715 760 767 773 774 775 781 797 857 980 991 1011 1023 1031 1033 1040 1087 1130 1134 1263 1276 1344 1351 1390 1437 1509 1542 1591 1661 1662 1733 1784 1785 1944 1970 2009 2012 2034 2092 2094 2100 2154 2212 2233 2353 2367 2446 2459 2475 2543 54 2321 463 1206 2124 77 2027 2417 1229 46 1290 390 1454 2278 278 1411 1723 2500 78 828 2077 622 1527 1867 230 919 1368 1021 2088 834 2305 1050 1444 1113 1466 1660 2495 40 45 221 533 571 829 890 1024 1210 1246 1380 1473 1642 1832 1887 2060 2134 2203 2211 2320 2435 439 440 570 826 1268 1464 2309 2311 2569 273 449 516 868 900 927 1157 1354 1588 1608 1638 2021 2090 2300 2329 2410 2441 361 554 766 1047 1357 1414 1415 1792 1852 2051 2231 2244 2296 2368 2387 2494 2528 239 518 274 2579 739 2529 1828 423 2391 1938 2250 419 507 588 1037 1094 1771 1794 2132 2227 2288 619 1933 2582 1604 443 577 836 843 1028 1265 1664 2099 2135 2322 566 49 1179 1537 1804 367 444 517 548 820 1610 2553 307 738 758 971 1119 1266 1305 1412 1572 2411 909 1012 606 789 1456 2252 1029 124 558 1142 2089 424 1062 1251 1883 272 1438 2171 321 973 1364 422 1550 2514 2515 608 2293 1636 2328 2480 567 799 2138 1151 1353 402 1054 1472 2263 2264 346 1701 496 1308 692 1699 855 604 895 1910 668 2166 1253 2430 2199 377 931 378 1710 1711 2010 1611 2490 624 2172 1462 993 936 2008 2323 2208 1919 2133 1939 18 1528 2037 32 2248 263 296 2222 1381 2112 477 1772 1841 2517 270 987 2170 1948 2505 332 1496 2251 1038 1228 1959 2436 85 341 1432 2345 1470 1756 93 352 1769 1729 2083 326 2031 612 1500 1001 1974 21 995 2136 1538 1555 2364 2054 1240 2189 1831 25 2548 2559 1579 57 84 2066 998 1427 2504 1007 1241 66 2004 2256 1601 613 754 1605 200 338 2280 1367 2150 968 2489 1973 201 1665 997 1439 1668 2527 303 396 897 906 976 1449 2111 2287 1996 1876 436 1010 1180 2020 1255 2567 1595 1903 1145 2057 2115 508 1064 452 1196 1383 1025 1146 746 2392 861 915 27 1002 1896 43 1594 1247 182 871 1399 903 1009 2319 73 1295 2352 2191 106 1863 2192 107 1989 1681 532 1424 1431 1960 406 2044 791 1086 1223 2197 166 595 250 339 1227 2351 1428 1446 2418 1942 2290 2537 4 671 2338 198 751 1248 540 1590 382 2484 1388 1172 1173 2270 2523 1624 1790 13 2581 538 1311 2235 1287 2017 71 500 673 921 2078 504 1109 1881 778 1132 1589 1606 1110 1164 1169 90 1041 1905 383 2181 621 1375 670 916 753 236 723 1567 1623 591 2563 727 1880 183 747 1420 210 2217 2147 1708 453 520 547 587 690 691 1270 2186 2187 2491 1434 2447 1750 2202 70 2499 1063 1260 1418 1992 2030 373 2033 986 1967 1124 2221 2355 2439 933 369 1516 769 1634 2234 1788 669 1820 1402 5 592 1048 512 515 1628 246 264 448 1285 1598 2318 2426 564 1220 1532 1495 1583 2085 631 2575 1494 225 961 1471 286 2303 80 575 1174 1175 1612 1648 2268 2538 544 2388 2580 2314 399 1779 1935 412 2073 1743 1762 2473 864 1979 1103 47 950 2214 618 2195 418 2432 674 1372 1327 2354 617 2557 935 1348 2114 67 343 590 557 761 2534 119 1320 1751 686 2407 380 1461 64 854 2247 2460 539 1325 1045 245 629 1183 1129 1840 2406 293 1280 1324 1440 1478 1988 337 645 772 1534 611 1757 525 620 866 885 1036 1497 2194 2216 634 654 655 1622 2434 2464 2511 582 1901 2070 2506 805 1342 2524 2084 176 1480 2230 2341 689 1520 1393 1968 351 783 884 2307 741 865 1800 2059 1801 2063 265 398 930 1507 1822 292 697 812 8678 1258 1101 1267 1856 2308 633 928 1902 1809 1277 1396 1098 1865 1122 2028 2029 1858 1186 1293 1837 2182 2474 44 749 1300 2013 22 320 782 703 1307 2198 140 609 329 1651 1846 1851 705 2301 2578 811 886 2130 1111 1159 1160 1108 1117 2573 89 1897 252 762 647 2079 958 2412 2486 2297 2310 94 1789 144 145 1647 1707 223 1844 1301 1640 2295 251 977 1720 505 514 969 1369 1416 696 724 1505 1219 1309 1904 1336 1607 2254 1745 1780 180 934 1114 205 1709 2550 117 1719 165 482 1951 1215 2572 445 1074 1451 1906 824 1225 962 2169 118 446 1515 125 1932 2069 2403 603 1096 1131 1931 2564 2549 489 1365 1181 1853 589 1878 984 2201 1460 1814 421 1911 542 779 1498 1888 2093 550 904 2449 926 2455 983 1034 1015 1121 1296 2022 2416 153 658 1197 1207 1450 1928 2146 208 209 456 748 1735 498 1985 1565 1980 1592 1766 543 1201 1200 838 2141 1236 1714 985 531 1838 844 1882 2188 1084 1095 275 447 2041 1148 2317 1748 1855 1912 2121 1765 481 2178 2423 497 1259 1576 1346 2324 2376 1749 1230 2304 2395 6 408 891 1649 101 485 551 825 917 1165 1524 1600 1631 1644 244 1052 717 301 355 1731 1499 1586 1724 2265 2463 416 1629 1799 1857 1893 2379 2385 2530 1120 2468 2588 1158 1561 1551 556 1389 1553 722 1274 1403 1688 1921 1987 2302 2404 154 529 218 1946 553 1395 625 808 237 271 1909 305 306 1475 565 1275 859 1556 752 1956 1360 1235 2558 7 214 317 704 804 975 1232 1321 1404 1861 2452 2453 650 920 1039 1302 1425 2485 190 1184 1005 2346 356 1166 1185 1273 1566 1627 1824 2298 2451 232 400 486 576 988 1304 1349 1517 1685 1726 1744 2184 2274 2493 156 1934 2204 295 511 638 718 822 1137 1203 1205 1529 1593 1845 2056 526 1543 1682 623 744 1090 1761 2228 2598 16 88 410 431 695 698 1176 1447 1486 1570 1650 1913 2229 2246 2419 2472 52 374 394 809 1492 1889 2046 685 1563 2095 2126 2161 2162 2378 36 2415 185 435 1097 1292 1319 1704 1869 2285 2107 2108 11 1178 2283 349 776 1269 560 2330 938 1162 1737 1288 1986 847 1303 1614 1965 174 334 664 1548 1479 1966 702 2390 1508 1947 357 1312 922 461 462 581 1358 2072 2520 2062 2152 325 1237 1493 2011 1533 79 397 1675 2183 2356 2331 391 688 1013 1540 1613 2260 2405 1621 2123 96 1161 2583 1405 2249 2565 342 1930 2498 1521 893 1721 2109 2433 1808 2333 939 1271 280 2086 2039 2040 2224 1115 1222 1781 1261 375 666 2561 1231 720 2087 1981 1256 438 1272 1421 1815 1868 1218 1291 616 1618 2185 541 1854 1962 667 839 1864 2113 957 1083 1963 2239 2518 359 1407 644 941 1298 457 458 1547 2562 1445 1055 2294 2344 1262 2551 2117 2526 376 713 506 2519 714 97 1617 204 2238 434 2284 128 179 2015 1363 1929 2337 122 134 249 1569 1619 222 1687 2462 1359 1834 2431 255 257 706 510 680 1514 2209 874 2000 298 851 972 1635 1817 2232 318 1643 1891 768 1717 407 1501 1759 2313 2025 469 709 572 291 1741 1541 2533 600 676 1139 967 1188 2299 675 2587 678 1123 1214 1469 2024 2282 519 740 1394 1457 1807 2253 815 1575 853 1459 1504 9 202 279 308 328 470 745 786 801 1136 1141 1382 1410 1481 1585 1819 1953 2116 2242 2289 2488 2601 114 1078 2291 19 247 313 368 388 474 479 490 491 495 524 681 725 726 1152 1194 1347 1376 1483 1484 1544 1577 1676 1686 1821 1940 1983 1998 1999 2103 2219 2374 2422 39 160 161 315 840 883 1836 1991 2591 800 1106 1805 683 684 733 1140 1602 1632 1716 1971 1972 2516 58 157 178 330 433 509 679 712 785 869 1107 1306 1833 2042 2043 2091 2271 2312 2315 2397 2398 2424 2428 2599 562 2165 1000 1057 48 269 466 845 964 1725 1742 158 1335 1802 215 426 721 795 898 1049 1337 1409 2342 120 191 807 2096 2400 2401 409 894 923 643 1453 50 68 1835 1916 95 123 216 217 411 425 770 1177 1502 1641 1732 1823 2148 98 569 1637 1816 951 2106 955 1069 1695 1696 1705 2064 2167 2276 2292 192 211 1035 2168 2277 932 1917 2159 14 846 1426 1890 2065 2164 2275 138 219 862 1511 1587 1860 69 978 2266 2334 956 1522 105 1458 1564 1190 467 2377 1105 806 1826 1212 1060 2566 437 527 882 23 24 72 116 188 389 450 639 682 1153 1226 1281 1318 1374 1578 1778 2286 2360 2501 115 331 336 372 771 848 1093 1198 1392 1513 1557 1818 2003 2456 2542 2586 2600 206 471 605 672 946 948 954 1217 1315 1366 1429 1482 1580 1639 1646 2045 2175 2196 2336 2347 2420 2570 2592 1391 1371 2470 1487 2210 33 314 415 459 478 480 573 858 1026 1135 1167 1257 1282 1568 1597 1693 1718 1746 1777 1872 1995 2153 2380 2508 2576 2577 2585 56 632 699 892 899 1189 1328 1503 1770 1775 1776 1796 1975 2149 2383 2384 2461 1099 1700 2450 186 875 2101 1850 2158 2362 2544 614 1477 731 1334 12 1560 545 708 2035 147 268 656 1168 1786 2502 1163 1954 2492 1758 451 546 1102 1058 1070 615 630 65 2413 1657 233 974 1104 901 2225 659 2160 959 350 552 2014 1549 831 1066 1075 1022 2363 1571 2137 2443 2539 363 1079 2058 610 1127 1128 2180 2594 2156 17 203 501 730 863 1043 1044 1736 1898 1969 62 108 151 199 366 430 728 1245 1283 1286 1417 1630 1730 1791 1914 1984 2173 2273 2386 2425 2444 276 1782 175 327 1067 1310 261 735 1539 2255 574 1077 1908 852 2118 2262 1536 2102 53 148 262 370 392 636 870 947 953 966 1294 1299 1345 1430 1441 1659 1715 1747 1847 2018 2049 2176 2179 2408 2458 2536 63 687 693 949 1027 1118 1370 1523 1525 1603 2104 2402 2509 2545 1323 135 765 989 2540 126 1941 2131 193 227 290 393 742 1068 1191 1195 1706 1811 2409 2429 942 2532 136 181 207 335 719 756 902 1199 1422 1545 1558 1616 1625 1698 1810 1827 2019 2122 2177 2479 2571 2129 20 732 757 1582 2080 387 2050 2174 34 109 694 734 823 990 1170 1213 1655 2119 149 194 287 371 657 1596 1626 1658 2440 1080 1922 1530 1961 1674 2140 288 832 1278 1448 2541 1518 2007 2469 2139 28 170 1546 2399 700 1030 2061 1994 127 1871 1046 103 1670 2442 743 1465 522 628 2481 2076 1408 878 1531 1574 1918 2471 41 381 1340 798 1202 2240 340 2200 651 661 1330 1455 1331 2350 1413 2357 1352 2512 141 1997 254 2427 2375 1006 1839 1923 1734 294 475 1955 2332 583 1739 1926 1 173 347 849 1081 1100 1192 1488 1752 1783 2205 2259 2382 2589 121 155 285 563 790 796 929 992 1019 1419 1468 2023 2125 130 55 258 534 231 1332 1859 2151 535 1387 1519 737 2467 1149 230 919 1368 1021 2088 834 2305 1050 1444 2 133 142 184 248 637 653 818 889 912 1193 1243 1254 1615 1678 1993 2163 2261 2555 26 59 104 159 172 428 663 860 1014 1452 1485 1559 1620 1663 2267 2525 2554 2584 1116 37 163 195 234 358 362 598 784 813 965 1020 1032 1042 1059 1065 1088 1249 1252 1343 1795 1825 1892 1899 1925 1964 2002 2258 2358 2535 2546 2597 162 297 536 652 856 1238 1244 1510 1552 1554 1609 1727 1738 1787 1945 2048 2257 2335 2373 2457 2496 2497 110 918 1373 1978 360 2036 1073 1506 1204 2223 15 82 129 131 312 660 677 833 905 913 1089 1138 1239 1329 1378 1512 1584 1680 1874 1884 2343 2448 2552 83 91 112 150 152 260 267 282 299 549 599 764 1053 1379 1753 2075 2145 2547 132 2478 1673 1581 1873 1653 30 87 137 235 256 345 364 403 594 907 1092 1284 1476 1489 1526 2047 2207 2272 2349 2361 2396 2596 31 100 220 243 277 281 354 455 460 483 710 814 850 970 1689 1900 2052 2215 2339 2348 2465 2574 2595 2068 35 102 259 266 309 311 429 729 819 837 999 1187 1250 1264 1760 1768 1806 1886 1920 2001 2067 2325 2340 2359 2556 420 1573 1728 1091 1669 2522 3 405 528 2098 2365 2366 2477 642 777 1072 1677 384 711 925 1234 1316 1341 1793 1875 1927 842 1377 1915 2454 2097 29 1385 385 417 641 835 1008 1233 2038 2466 2560 242 810 841 982 1156 1297 1436 1672 1683 1763 1798 1862 1990 2483 627 1773 189 310 530 2327 788 943 2438 649 1144 2144 945 1211 1406 1171 1491 1694 224 379 503 940 1314 1645 1076 2437 240 2389 952 2074 1797 2237 2326 395 473 502 561 880 1463 1812 1813 944 2206 2218 2306 1339 2414 981 1703 1656 1895 2157 2243 61 143 465 472 593 646 755 792 876 1003 1384 1671 1907 2120 2245 2269 2503 169 253 322 877 1004 1056 1071 1397 1713 2128 2220 2281 2482 2568 996 1870 1279 1866 413 414 468 763 896 1150 1155 1401 2110 2155 2241 2226 736 1652 1338 1490 86 559 2381 441 1936 1442 168 2236 442 1356 1755 1937 92 333 464 817 1082 1182 1216 1350 1690 1712 1767 2393 537 802 1126 1289 1702 1754 1803 1894 1950 2193 2421 139 323 348 1386 1692 1112 1355 1423 1467 1957 2372 787 1740 908 2510 283 284 816 924 1154 2055 2127 2476 1361 662 1333 2005 1764 1843 42 60 586 487 910 2026 2006 1562 319 1317 1051 488 585 1829 881 196 821 596 872 2032 241 493 911 665 1952 887 1017 2105 238 914 300 1684 365 707 2142 2593 1145 2057 2115 74 289 167 888 499 584 2531 197 793 555 1666 324 937 2071 1209 1435 1830 2370 513 1697 780 1147 1085 1958 794 2521 1943 2590 759 803 1599 873 1326 2371 76 146 750 1224 1535 1654 316 1143 2394 648 1885 716 1443 2487 1125 1679 1691 75 2316 640 635 1400 1633 2190 879 1879 1949 81 701 568 111 602 1208 1433 601 1667 1722 960 2143 2016 344 979 580 1977 401 1982 830 1976 1398 1242 1313 10 38 99 113 164 171 187 212 213 226 228 229 302 304 386 404 427 432 454 484 494 521 523 715 760 767 773 774 775 781 797 857 980 991 1011 1023 1031 1033 1040 1087 1130 1134 1263 1276 1344 1351 1390 1437 1509 1542 1591 1661 1662 1733 1784 1785 1944 1970 2009 2012 2034 2092 2094 2100 2154 2212 2233 2353 2367 2446 2459 2475 2543 54 2321 463 1206 2124 77 2027 2417 1229 78 828 2077 1527 622 815 1575 291 1741 1541 2533 519 740 853 1394 1457 1807 600 676 1139 967 1188 2299 675 2587 678 1123 1214 1469 2024 2282 452 1196 1383 1025 1146 746 2392 861 915 1867 1113 1466 1660 2495 40 45 221 533 571 829 890 1024 1210 1246 1380 1473 1642 1832 1887 2060 2134 2203 2211 2320 2435 439 440 570 826 1268 1464 2309 2311 2569 443 577 836 843 1028 1265 1664 2099 2135 2322 619 1933 2582 239 518 2579 423 2391 274 739 2529 1828 272 1438 2171 321 973 1364 419 507 588 1037 1094 1771 1794 2132 2227 2288 566 1604 1938 2250 49 1179 1537 1804 367 444 517 548 820 1610 2553 307 738 758 971 1119 1266 1305 1412 1572 2411 909 1012 1029 606 789 1456 2252 273 449 516 868 900 927 1157 1354 1588 1608 1638 2021 2090 2300 2329 2410 2441 361 554 766 1047 1357 1414 1415 1792 1852 2051 2231 2244 2296 2368 2387 2494 2528 46 1290 390 1454 2278 278 1411 1723 2500 402 1054 1472 2263 2264 124 558 1142 2089 424 1062 1251 1883 422 1550 2514 2515 608 2293 1636 2328 2480 567 799 2138 1151 1353 346 1701 496 1308 692 1699 604 895 1910 1253 2430 2199 668 2166 377 378 931 1710 1711 2010 1611 855 2490 508 1064 624 2172 1462 993 2208 936 2008 2323 1919 2133 1939 18 1528 2054 21 995 2136 1555 2364 1538 32 2248 263 1240 2189 1831 326 1241 57 2504 84 2066 998 1007 270 987 2170 2037 1470 1948 2505 296 2222 1381 2112 477 1772 1841 2517 85 341 2031 352 1432 2345 1756 2083 93 1729 1769 1001 1974 25 1579 2548 2559 1228 1959 2436 332 1038 1496 2251 626 1061 1877 1322 1924 2507 51 1474 597 1842 2213 2445 2513 476 1774 2369 177 1133 2279 607 2053 353 579 1016 578 1018 827 994 66 1601 613 1605 754 2004 2256 436 1010 492 1180 1255 2567 1595 1903 200 338 2280 303 968 2150 201 1427 1973 2489 997 1439 1367 1665 1996 1668 2527 396 897 976 1449 2111 906 2287 1876 27 1002 1594 43 2319 1247 871 1399 903 1009 2020 141 2537 743 1446 2418 182 1997 1734 2375 254 2427 1839 1006 1923 73 1295 2352 2191 532 1424 1431 107 1989 1681 612 1500 1896 406 791 1086 1223 1960 2044 2197 106 1863 2192 250 339 2351 1227 1942 2290 166 595 28 170 2399 1465 1046 1546 294 475 391 2332 1531 1574 1918 2471 127 1871 628 2481 2076 878 1408 1994 700 1030 2061 522 1371 2210 661 1487 2470 41 1202 381 1340 340 2200 651 1330 1455 103 1670 2442 1331 2350 1413 2357 798 1352 2512 1428 4 671 2338 80 575 1174 1175 1612 1648 2268 2538 538 1311 2235 540 1590 544 2388 2580 412 2073 864 1979 1103 176 1480 1762 2473 2314 246 264 448 1285 1598 2318 2426 689 1520 1393 1968 741 865 1800 2059 225 961 1471 2078 286 2303 1743 631 2575 1494 647 2079 2230 2341 958 2412 2297 2486 882 2310 1098 1865 1657 2028 2029 5 592 1048 512 515 1628 399 1779 1935 418 2432 688 1013 1540 1613 1172 1173 2260 2405 47 950 1327 2354 618 2195 674 1372 2214 67 2534 557 761 564 1220 1532 1495 1583 2085 64 854 2247 2460 539 1325 1045 293 1280 1324 1440 611 1757 582 1901 2070 2506 614 1477 1478 1988 731 1334 1342 2524 2084 292 697 812 867 1258 351 783 884 2307 1129 1840 2406 1101 1267 1856 2308 525 620 866 885 1036 1497 2194 2216 633 928 1902 1122 1801 2063 634 654 655 1622 2434 805 1621 2123 1277 1396 2464 2511 8 1858 22 320 782 703 1307 1186 1293 1837 2182 2474 44 749 901 2225 609 1163 1954 2492 65 2413 2198 233 974 1104 1300 2013 265 398 930 1507 1822 89 1897 2301 2578 252 762 140 705 1117 2573 329 2130 1309 1904 1108 1851 1651 1846 1111 1159 1160 12 1560 1128 545 708 2035 1022 2363 1549 2180 2594 363 1079 2058 610 1127 1067 1310 350 552 2014 831 1066 1075 1571 2137 2443 2539 2156 659 2160 959 147 268 656 1058 1070 1168 1786 2502 1758 451 615 546 1102 20 732 757 1582 2080 2050 2174 630 387 1674 2140 1530 1961 34 109 694 734 823 990 1170 1213 1655 2119 149 194 287 371 657 1596 1626 1658 2440 1080 1922 193 227 290 393 742 1068 1191 1195 1706 1811 2409 2429 126 2139 1941 2131 1518 2007 2469 135 765 989 2540 288 832 1278 1448 2541 136 181 207 335 719 756 902 1199 1422 1545 1558 1616 1625 1698 1810 1827 2019 2122 2177 2479 2571 942 2532 1323 53 148 262 370 392 636 870 947 953 966 1294 1299 1345 1430 1441 1659 1715 1747 1847 2018 2049 2176 2179 2408 2458 2536 62 108 151 199 366 430 728 1245 1283 1286 1417 1630 1730 1791 1914 1984 2173 2273 2386 2425 2444 63 687 693 949 1027 1118 1370 1523 1525 1603 2104 2402 2509 2545 206 471 605 672 946 948 954 1217 1315 1366 1429 1482 1580 1639 1646 2045 2175 2196 2336 2347 2420 2570 2592 276 1782 175 327 806 1826 1212 261 735 1539 2255 245 629 1183 337 645 574 1077 1908 772 1534 852 2118 2262 2129 1536 2102 6 408 891 1649 556 1389 1553 717 407 1501 1759 2313 722 1274 1403 1688 1921 1987 2302 2404 122 134 249 1569 1619 222 1687 2462 1359 1834 2431 101 485 551 825 917 1165 1524 1600 1631 1644 244 1052 416 1629 1799 1857 1893 2379 2385 2530 1120 2468 2588 1158 1561 1551 255 257 706 874 2000 318 1643 510 768 1717 680 1514 2209 298 851 972 1635 1817 2232 1891 2025 469 709 2240 572 97 1617 128 204 2238 179 2015 2253 434 2284 2185 438 1272 616 1618 1218 1291 1421 1815 1868 1363 1929 2337 117 1719 165 1215 2572 482 1951 589 1878 984 2201 1460 1814 489 1365 1181 1853 421 542 1911 779 1498 1888 2093 1459 1504 119 1320 1751 343 590 380 1461 686 2407 617 2557 935 1348 2114 342 1930 2498 1521 839 1864 2113 453 520 1909 547 587 690 359 1407 1055 2294 2344 644 941 1298 957 1083 2239 2518 2107 2108 541 1115 667 1963 1854 1962 375 2551 2117 2526 457 458 1547 2562 1262 1445 376 713 506 2519 527 714 280 2086 1222 1781 2039 2040 2224 720 2087 1981 1256 1231 1261 963 1221 1362 2081 2082 1848 1849 7 214 317 704 804 975 1232 1321 1404 1861 2452 2453 301 355 1731 565 1275 650 920 1039 1302 1425 2485 625 808 1499 1586 1724 2265 2463 16 88 410 431 695 698 1176 1447 1486 1570 1650 1913 2229 2246 2419 2472 50 68 1835 1916 52 374 394 809 1492 1889 2046 36 2415 1869 2285 185 435 1097 1292 1319 1704 685 1563 2095 2126 2161 2162 2378 156 1934 2204 190 1682 356 1166 1185 1273 1566 1627 1824 2298 2451 295 511 638 718 822 1137 1203 1205 1529 1593 1845 2056 526 1543 623 744 1090 1761 2228 2598 232 400 486 576 988 1304 1349 1517 1685 1726 1744 2184 2274 2493 859 1556 1005 2346 2062 2152 305 306 1235 2558 11 1178 2283 349 776 1269 824 1225 1074 962 2169 1451 1906 154 529 218 1946 752 1956 1360 1475 237 271 553 1395 79 397 1675 2183 2356 560 2330 1614 1965 96 1161 2583 1405 2249 938 1162 1288 1986 357 1312 922 893 1721 2109 2433 939 1271 1808 2333 174 334 664 1548 1479 1966 1737 461 462 581 1358 2072 2520 847 1303 702 2390 1508 1947 325 1237 1493 2011 2331 1533 2549 9 202 279 308 328 470 745 786 801 1136 1141 1382 1410 1481 1585 1819 1953 2116 2242 2289 2488 2601 114 1078 2291 33 314 415 459 478 480 573 858 1026 1135 1167 1257 1282 1568 1597 1693 1718 1746 1777 1872 1995 2153 2380 2508 2576 2577 2585 115 331 336 372 771 848 1093 1198 1392 1513 1557 1818 2003 2456 2542 2586 2600 1391 23 24 72 116 188 389 450 639 682 1153 1226 1281 1318 1374 1578 1778 2286 2360 2501 56 632 699 892 899 1189 1328 1503 1770 1775 1776 1796 1975 2149 2383 2384 2461 186 875 2101 1850 2158 2362 2544 1099 1700 2450 19 247 313 368 388 474 479 490 491 495 524 681 725 726 1152 1194 1347 1376 1483 1484 1544 1577 1676 1686 1821 1940 1983 1998 1999 2103 2219 2374 2422 95 123 216 217 411 425 770 1177 1502 1641 1732 1823 2148 158 1335 1802 215 426 721 795 898 1049 1337 1409 2342 39 160 161 315 840 883 1836 1991 2591 1184 1805 683 684 733 1140 1602 1632 1716 1971 1972 2516 800 1106 58 157 178 330 433 509 679 712 785 869 1107 1306 1833 2042 2043 2091 2271 2312 2315 2397 2398 2424 2428 2599 562 1000 1057 2165 14 846 1426 1890 2065 2164 2275 105 1458 978 2266 2334 17 203 501 730 863 1043 1044 1736 1898 1969 138 219 862 1511 1587 1860 1190 69 956 1522 192 211 1035 2168 2277 98 569 1637 1816 437 467 2377 1564 1060 2566 1809 1105 48 269 466 845 964 1725 1742 932 1917 2159 955 1069 1695 1696 1705 2064 2167 2276 2292 120 191 807 2096 2400 2401 409 894 923 643 1453 2565 951 2106 1955 13 2581 504 1109 1881 778 1132 1589 1606 90 1041 1905 621 1375 670 916 753 71 500 673 921 1287 2017 382 2484 1388 198 751 1248 1624 1790 2270 2523 591 2563 727 1880 70 2499 1260 1418 1992 2030 236 723 1063 1567 1623 583 1739 1926 933 118 446 1515 445 125 1932 2069 208 209 383 2181 1131 1931 2403 2564 456 603 1096 180 934 205 1709 2550 666 2561 1708 1114 183 210 2217 2147 1434 2447 1750 2202 691 1270 2186 2187 2491 747 1420 94 1789 1745 1780 724 1505 1607 2254 144 145 1647 1707 251 977 1720 1219 1336 505 514 969 1369 1416 696 811 886 223 1844 1301 1110 1640 2295 1164 1169 2033 369 1516 669 1820 769 1788 1634 2234 1402 373 1236 1714 838 2141 986 1967 1124 1592 1766 2221 2355 2439 153 497 1259 1576 481 2178 1148 2423 275 2317 447 2041 1765 1230 2304 2395 1346 2324 2376 1749 1748 1855 1912 2121 498 1985 748 1735 658 1565 1980 531 1838 1928 2146 543 1201 844 1197 1207 1450 1200 1084 1882 2188 1095 985 550 904 2449 2022 2416 926 2455 983 1015 1034 1121 1296 0 Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering Dendrogram, OM1 versus OM3 different indel costs (1 vs 3) Dendrogram of agnes(x = dist.om3, diss = TRUE, method = "ward") OM3 dist.om3 Agglomerative Coefficient = 1 Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering State distribution by age, within cluster 0 1 2 3 4 5 6 7 1.6 % 1.7 % 1.8 % 1.0 Groupe 3 1.0 Groupe 2 1.0 Groupe 1 0.8 0.8 2.4 % 0.6 Frequency Frequency 0.6 2.4 % 0.4 0.2 0.4 3.5 % 0.2 0.2 0.4 Frequency 0.6 0.8 2% 0.0 0.0 0.0 4.3 % A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 Age Age A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 Age 4.5 % 1.0 Groupe 6 1.0 Groupe 5 1.0 Groupe 4 0.8 0.8 A21 0.6 A19 A23 A25 A27 A29 0.2 0.4 Age 0.0 0.2 0.4 Frequency A17 Frequency 0.6 A15 0.0 0.0 0.2 0.4 Frequency 0.6 0.8 4.7 % A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 Age Age Age 13/7/2008gr 58/86 Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering Most frequent sequences by cluster 0 1 2 3 4 5 6 7 1.6 % Groupe 1 Groupe 2 Groupe 3 1.7 % 1.8 % 5.1 % 2.3 % 6.5 % 2.3 % 6.5 % 2.6 % 6.9 % 2.6 % 8% 2.9 % 1.2 % 1.5 % 1.5 % 1.6 % 3.2 % 8% 1.3 % 2% 2.4 % 2.4 % 1.6 % 3.5 % 8.4 % 8.4 % 4.1 % 9.1 % 4.1 % 11.3 % 5% A15 A22 A29 A36 A43 1.7 % 3.5 % 1.8 % 1.9 % 2.3 % 4.3 % A15 A22 A29 A36 A43 A15 A22 A29 A36 A43 4.5 % Age Age Age Groupe 4 Groupe 5 Groupe 6 4.7 % 0.8 % 0.8 % 0.8 % 0.8 % 0.8 % 0.8 % 1.6 % 1.6 % 3.9 % 1.9 % 1.9 % 3.4 % 0.8 % 0.8 % 4.4 % 0.8 % A15 0.8 % A19 A21 4.8 % 0.8 % A23 A25 Age 4.8 % 0.8 % 7.8 % 0.8 % 57.5 % A17 4.8 % 0.8 % 8.2 % 0.8 % 10.2 % 1.3 % A15 A22 A29 Age 13/7/2008gr 59/86 A36 A43 A15 A22 A29 Age A36 A43 A15 A22 A29 Age A36 A43 A27 Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering I-plot by cluster 0 1 2 3 4 5 6 7 1.6 % 1.7 % 1.8 % 2% 2.4 % 2.4 % 3.5 % 4.3 % 4.5 % 4.7 % A15 A17 A19 A21 A23 Age 13/7/2008gr 60/86 A25 A27 A29 Mining Event or State Sequences Visualizing and clustering sequence data Distances between sequences: Clustering Distribution by birth cohort within each cluster Année de naissance (Groupe 2) Année de naissance (Groupe 3) 300 250 200 150 Frequency 30 Frequency 30 1920 1930 1940 1950 1960 50 1910 1920 1930 1940 1950 1960 1910 1920 1930 1940 1950 année année année Année de naissance (Groupe 4) Année de naissance (Groupe 5) Année de naissance (Groupe 6) 1960 1910 13/7/2008gr 61/86 1920 1930 1940 année 1950 1960 40 30 Frequency 0 0 0 10 10 5 20 20 30 Frequency 10 Frequency 40 15 50 60 50 20 1910 0 0 0 10 10 20 100 20 Frequency 40 40 50 50 60 Année de naissance (Groupe 1) 1910 1920 1930 1940 année 1950 1960 1910 1920 1930 1940 année 1950 1960 Mining Event or State Sequences Visualizing and clustering sequence data Multidimensional Scaling representation of sequences Multidimensional Scaling: Principle Let D be a distance matrix between sequences. D computed using OM, LPS, LCS, ... metrics. Multidimensional Scaling consists in Finding p a set of real valued variables (f1 , f2 ) such that the δij = (fi 1 − fj 1)2 + (fi 2 − fj 2)2 best approximate the distances dij . between sequences. Plotting the points in the (f1 , f2 ) space. 13/7/2008gr 63/86 Mining Event or State Sequences Visualizing and clustering sequence data Multidimensional Scaling representation of sequences Multidimensional Scaling ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ●● ● ● ●● ● ● ● ● ● 20 ● ● ● ● ● ● ● 10 ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● 13/7/2008gr 64/86−30 ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Groupe 1 Groupe 2 Groupe 3 Groupe 4 Groupe 5 Groupe 6 ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● 0 dist.om.mds$points[,2] ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● −10 ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● −10 0 10 20 30 Mining Event or State Sequences Mining Frequent Episodes Mining Frequent Episodes What can we expect from frequent episodes mining? GSP (Srikant and Agrawal, 1996) MINEPI, WINEPI (Mannila et al., 1997) TCG, TAG (Bettini et al., 1996) SPADE (Zaki, 2001) Are there specific issues when applying these methods in social sciences? 13/7/2008gr 66/86 Mining Event or State Sequences Mining Frequent Episodes What Is It About? Frequent episodes. What is it? Episode: Collection of events occurring frequently together. Mining typical episodes: Specialized case of mining frequent itemsets. Time dimension ⇒ Partially ordered events. More complex than unordered itemsets: User must specify time constraints (and episode structure constraints). select a counting method. 13/7/2008gr 68/86 Mining Event or State Sequences Mining Frequent Episodes What Is It About? Episode structure constraints For people who leave home within 2 years from their 17, what are typical events occurring until they get married and have a first child? LH,17 elastic w =1 event constraints node constraint 13/7/2008gr 69/86 C1 parallel ?? ) ,4 (0 w =2 (0, 1, 10) (0 ,3 ) edge constraints M Mining Event or State Sequences Mining Frequent Episodes What Is It About? Counting methods (Joshi et al., 2001) Searching (U,C) U U U C C C 20 21 22 23 24 13/7/2008gr 70/86 min gap= 1, max gap= 2, win size= 2 indiv. with episode COBJ = 1 windows with episode CWIN = 3 min win. with episode CminWIN = 2 distinct occurrences CDIS_o = 5 dist. occ. without overlap CDIS = 3 hi ld M < ar M a r C iag rria hi e ld < ge = C M hil ar d ria ge C hi ld Jo < b Jo C <C b hi h ld ild C = hi Jo ld b Ed < uc Ed u C en c hi ld d < end = E d Chi u c ld en M ar d ria Jo g b e< < M M Jo ar a b ria rri M ge ag ar ria = e Ed g Jo b uc e < M en E d ar d u ria < c ge M e n = arri d Ed ag uc e Jo en b d < Ed E uc du c J o en e b d nd = < Ed J o uc b en d C Mining Event or State Sequences Mining Frequent Episodes Example: Counting Alternate Episode Structures Example: Counting alternate structures 13/7/2008gr 72/86 (COBJ, no max gap) 30% 25% 20% 15% 10% 5% 0% Switzerland, SHP 2002 biographical survey (n = 5560). Mining Event or State Sequences Mining Frequent Episodes Issues Regarding Episode Rules Rules between episodes Social scientists like causal explanations. Empirically assessed rules are valuable material in that respect. Little attention paid to this aspect in the literature on frequent subsequences. Mined episodes are already structured: if (U,C) is a frequent episode, then we know that C often follows U. Deriving association rules from frequent ordered patterns is similar to what is done with unordered itemsets. Rule relevance criteria: confidence, surprisingness, implication strength, ... Their value depends on the selected counting method. 13/7/2008gr 74/86 Mining Event or State Sequences Mining Frequent Episodes Issues Regarding Episode Rules Issues with episode rules in social sciences Parallel life courses: Family events and professional life course. Life courses of each partner of a couple. Mining associations between frequent episodes of a sequence with those of its parallel sequence. Frequent episodes from mix of the 2 sequences, and then restrict search of rules among candidates with premise and consequence belonging to a different sequence. Frequent episodes from each sequence, and then search rules among candidates obtained by combining frequent episodes from each sequence. Accounting for multi-level effects when validating rules. Is rule relevant among groups, or within groups? 13/7/2008gr 75/86 Mining Event or State Sequences Summary Summary Data mining approaches (survival trees, clustering sequences, frequent episodes) have promising future in life course analysis. Complement classical statistical outcomes with new insights. Their use within social sciences raises specific issues: Accounting for multi-level effects when growing survival tree or mining association rules. Handling time varying predictors in survival trees. Selecting relevant counting methods (event dependent)? Suitable criteria for measuring association strength between frequent episodes. ... 13/7/2008gr 76/86 Mining Event or State Sequences Summary Our TraMineR R-package Let me finish with an Add ... TraMineR, a free life trajectory mining tool for the free open source R statistical environment. downloadable from http://mephisto.unige.ch/biomining and soon from the CRAN 13/7/2008gr 77/86 Mining Event or State Sequences Summary Thank Thank You! You! 13/7/2008gr 78/86 Mining Event or State Sequences Appendix Zoomed tree n = 3 6 1 9 e = 6 2 2 Divorce, Switzerland, Differences B i r t h in C o KM h o r t Survival Curves T W (1 ) = 5 4 .8 , p < .0 0 0 1 I > 1 9 4 0 £ 1 9 4 0 S < 9 0 % S < 9 0 % a t 2 1 S (3 0 ) = 7 3 S (3 0 ) = 8 6 % n = e = n = 2 7 7 8 e = 4 9 9 8 4 1 1 2 3 C h ild L a n g u a g e T W (1 ) = 3 7 .4 , p T W (1 ) = 2 2 .5 , p < .0 0 0 1 N o n F re n c h S < 9 0 % a t 2 6 S (3 0 ) = 8 9 % n = e = 6 6 7 7 9 U n iv e r s ity T W ( 1 79/86 ) = 8 .0 8 , p = .0 0 4 5 13/7/2008gr a t Y e s F re n c h S < 9 0 % a t 1 1 a t 1 1 S (3 0 ) = 7 5 % S (3 0 ) = 7 4 % n = e = S < 9 0 % n = 2 1 7 5 e = 3 6 1 1 7 4 4 4 L 3 L a n g u a g e T W (1 ) = 9 .7 7 , p = .0 0 1 8 Mining Event or State Sequences Appendix Sub-sequences Clusters and subsequences m5 10 m5 c1 m1 0.0 10 0.0 13/7/2008gr 80/86 c1 0.2 s1 0.2 m1 0.4 e1 0.4 e5 0.6 10 0.6 e1 0.8 m1 0.8 d1 Groupe 2 1.0 m1 Groupe 1 1.0 Mining Event or State Sequences Appendix Sub-sequences Biofam data: Legend no event left home married with/without child left home, married with child left home, with child left home, married, child divorced 13/7/2008gr 81/86 Mining Event or State Sequences Appendix For Further Reading For Further Reading I Abbott, A. and J. Forrest (1986). Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 471–494. Bettini, C., X. S. Wang, and S. Jajodia (1996). Testing complex temporal relationships involving multiple granularities and its application to data mining (extended abstract). In PODS ’96: Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, New York, pp. 68–78. ACM Press. 13/7/2008gr 82/86 Mining Event or State Sequences Appendix For Further Reading For Further Reading II Billari, F. C. (2005). Life course analysis: Two (complementary) cultures? Some reflections with examples from the analysis of transition to adulthood. In P. Ghisletta, J.-M. Le Goff, R. Levy, D. Spini, and E. Widmer (Eds.), Towards an Interdisciplinary Perspective on the Life Course, Advancements in Life Course Research, Vol. 10, pp. 267–288. Amsterdam: Elsevier. Blossfeld, H.-P. and G. Rohwer (2002). Techniques of Event History Modeling, New Approaches to Causal Analysis (2nd ed.). Mahwah NJ: Lawrence Erlbaum. Elzinga, C. H. (2008). Sequence analysis: Metric representations of categorical time series. Sociological Methods and Research. forthcoming. 13/7/2008gr 83/86 Mining Event or State Sequences Appendix For Further Reading For Further Reading III Elzinga, C. H. and A. C. Liefbroer (2007). De-standardization of family-life trajectories of young adults: A cross-national comparison using sequence analysis. European Journal of Population 23, 225–250. Huang, X., S. Chen, and S. Soong (1998). Piecewise exponential survival trees with time-dependent covariates. Biometrics 54, 1420–1433. Joshi, M. V., G. Karypis, and V. Kumar (2001). A universal formulation of sequential patterns. In Proceedings of the KDD’2001 workshop on Temporal Data Mining, San Fransisco, August 2001. Leblanc, M. and J. Crowley (1992). Relative risk trees for censored survival data. Biometrics 48, 411–425. 13/7/2008gr 84/86 Mining Event or State Sequences Appendix For Further Reading For Further Reading IV Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710. Mannila, H., H. Toivonen, and A. I. Verkamo (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1(3), 259–289. Needleman, S. and C. Wunsch (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453. Segal, M. R. (1988). Regression trees for censored data. Biometrics 44, 35–47. 13/7/2008gr 85/86 Mining Event or State Sequences Appendix For Further Reading For Further Reading V Segal, M. R. (1992). Tree-structured methods for longitudinal data. Journal of the American Statistical Association 87 (418), 407–418. Srikant, R. and R. Agrawal (1996). Mining sequential patterns: Generalizations and performance improvements. In P. M. G. Apers, M. Bouzeghoub, and G. Gardarin (Eds.), Advances in Database Technologies – 5th International Conference on Extending Database Technology (EDBT’96), Avignon, France, Volume 1057, pp. 3–17. Springer-Verlag. Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning 42(1/2), 31–60. 13/7/2008gr 86/86