Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France) [email protected] Prague Sept. 04 About Computer science dep. • In Lyon, there are 3 universities, 100000 students • Lumière university Lyon 2, has 22000 students, • Lyon 2, is mainly a liberal art university • The faculty of economic has tree departments, among them the computer science one • We belong to this department • We have Bachelor, Master and PhD programs for 300 students ERIC Lab at the University Faculties of university of Lyon 2 Economic Research centers of the university Sociology Linguistic Law ERIC Knowledge Engineering Research Center - The budget of ERIC doesn’t depend from the university, it’s given par The national ministry of education - We have a large autonomy in decision making ERIC Lab • • • • • Born in 1995, 11 professors (N. Nicoloyannis, director) 15 PhD Students Grants+contracts+WK+…=200K€/year Research topics – Data mining (theory, tools and applications) – Data warehouse management (T,T,A) Data Mining (T,T,A) • Theory – Induction graphs – Learning and classification • Tools – SIPINA : Plate form for data mining • Applications – – – – Medical fields Chemical applications Human science … Data mining TTA for complex data Data mining on complex data • An example : Breast cancer diagnosis Motivations TXY Association measure : Let X and Y be two attributes Ω be a set of data It measures the strength of the relationship between X and Y X x1 , , x r Y y1 , , y c Contingency table TXY x1 X xr Y y1 n11 yc n1c n1. n r1 n.1 nrc n. c nr. n Motivations TXY Association measure : Let X and Y be two attributes Ω be a set of data It measures the strength of the relationship between X and Y X x1 , , x r Y y1 , , y c Contingency table TXY x1 X xr Y y1 n11 yc n1c n1. n r1 n.1 nrc n. c nr. n Motivations TXY Association measure : Let X and Y be two attributes Ω be a set of data It measures the strength of the relationship between X and Y X x1 , , x r Y y1 , , y c Contingency table TXY x1 X xr Y y1 n11 yc n1c n1. n r1 n.1 nrc n. c nr. n Motivations TXY Association measure : Let X and Y be two attributes Ω be a set of data It measures the strength of the relationship between X and Y X x1 , , x r Y y1 , , y c Contingency table TXY x1 X xr Y y1 n11 yc n1c n1. n r1 n.1 nrc n. c nr. n According to a specific association measure, may we improve the strength of the relationship by merging some rows and/or some columns ? Motivations TXY Association measure : According to a specific association measure, may we improve the strength of the relation ship by merging some rows and/or some columns ? It measures the strength of the relationship between X and Y Contingency table TXY x1 X xr Y y1 n11 T ' XY yc n1c n1. n r1 n.1 nrc n. c nr. n such that : c' c and r ' r and T ' XY TXY An example TXY tˆ 0.14 Goal: Find the groupings that maximize the association between attributes For the preceding example the maximization of the Tschuprow’s t gives T ' XY tˆ' 0.32 tˆ' tˆ Yes, we can improve the association by reducing the size of the contingency table Extension TXY Let X and Y be two attributes Ω be a set of data According to a specific association measure, may we find the optimal reduced contingency table ? X x1 , , x r Y y1 , , y c Contingency table TXY x1 X xr Y y1 n11 yc n1c * TXY n1. n r1 n.1 nrc n. c nr. n c* c l* l * max TXYi TXY i Optimal solution (exhaustive search) Goal : Find the best cross partition on T P TX : The set of all partitions brought about over X P TY : The set of all partitions brought about over Y #P TX : the size of the set P TX #P TY : the size of the set P TY The number of cases we have to check is l #P TX #P TY #P TX nominal case ordinal case Optimal solution (exhaustive search) #P TX nominal case ordinal case Optimal solution (exhaustive search) According to a specific association measure, may we find the optimal reduced contingency table ? Yes, but the solution is intractable in real word because of the high time complexity Heuristic Proceed successively to the grouping of 2 (row or column) values that maximizes the increase in the association criteria. Starting with T 0 Tr ,c the finest categories The algorithm determines successive ly T k k 1,2 Stop when T k T k 1 Complexity Simulation Goal: How far is the quasi-optimal solution from the true optimum? Comparison tractable for tables not greater than 6 × 6. Simulation Design Randomly generate 200 tables Analysis of the distribution of the deviations between optima and quasi-optima. Generating the Tables 10000 cases distributed in the cxr cells of the table with an uniform distribution (worst case). Quasi-optimal solution Quasi-optimal solution Conclusion • Implementation for new approach induction decision tree. – Zighed, D.A., Ritschard, G., W. Erray and V.-M. Scuturici (2003), Abogodaï,a New approach for Decision Trees, in Lavrac, N., D. Gamberger, L. Todorovski and H. Blockeel (eds), Knowledge Discovery in databases: PKDD 2003 , LNAI 2838, Berlin: Springer, 495--506. – Zighed D. A., Ritschard G., Erray W., Scuturici V.-M. (2003), Decision tree with optimal join partitioning, To appear in Journal of Information Intelligent Systems, Kluwer (2004). • Divisive top-down approach • Extension to multidimensionnal case