Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Contributions to MiningMart Petr Berka Laboratory for Intelligent Systems University of Economics, Prague [email protected] University of Economics, Prague LISp - Laboratory for Intelligent Systems SALOME - Laboratory for Multidisciplinary Approaches to Decision-making Support in Economics and Management MiningMart prezentation (c) Petr Berka, LISp, 2001 2 LISp research probabilistic methods - decomposable probability models and bayesian networks symbolic ML methods - 4FT association rules and decision rules logical calculi for knowledge discovery in databases MiningMart prezentation (c) Petr Berka, LISp, 2001 3 LISp activities Organized conferences Organized workshops ECML’97, PKDD’99 Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001), WUPES‘97, WUPES2000 International Projects MLNet, Sol-Eu-Net, EUNITE, KDNet MUM, MGT MiningMart prezentation (c) Petr Berka, LISp, 2001 4 SALOME research Quantitative and AI (pattern recognition, fuzzy, neural nets) approaches to support of decision making in econmics and management MiningMart prezentation (c) Petr Berka, LISp, 2001 5 SALOME activities Organized workshops STIPR‘97, MME‘99 International Projects Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge MiningMart prezentation (c) Petr Berka, LISp, 2001 6 LISp software LISp-Miner (data mining system) DataSource (for data manipulation) 4FT Miner (4FT association rules) and KEX (decision rules) experimental software for building graphical models preprocessing procedures related to KEX based on information theoretic approach MiningMart prezentation (c) Petr Berka, LISp, 2001 7 LISP-Miner procedures DataSource creating new (virtual) attributes using SQL ekvidistant and equifrequent discretization grouping attribute values computing attribute-value frequencies MiningMart prezentation (c) Petr Berka, LISp, 2001 8 LISP-Miner procedures 4FT-Miner (GUHA procedure) 4FT association rules in the form Ant ~ Suc / Cond KEX weighted decision rules in the form Ant C (weight) MiningMart prezentation (c) Petr Berka, LISp, 2001 9 4FT-Miner basic idea Generate a (potential) rule, e.g. COLOUR(red) SIZE(small) 0.9, 20 TEMP(high) AGE(21-30) SALARY(low) 0.85,15 PAYMENTS (High) LOAN(bad) Verify a rule using four-fold table Suc Suc Ant a b d Ant c p,B a TRUE iff a B p ab p, B TRUE iff a B MiningMart prezentation (c) Petr Berka, LISp, 2001 a p abc 10 KEX basic idea Generate a (potential) rule, e.g. YEARS-IN-COMPANY(0-3) AGE(0-25) LOAN(GOOD) If rule refines current set of rules (validity a/(a+b) differs from weight inferred during consultation) add into rule base with proper weight MiningMart prezentation (c) Petr Berka, LISp, 2001 13 LISp-Miner architecture MetaData (ODBC ACCESS) LM Data (ODBC ACCESS) Windows MiningMart prezentation (c) Petr Berka, LISp, 2001 Results 16 Preprocessing (LISp) KEX-oriented (fuzzy) discretization + grouping of values computing the amount of noise in data random sampling + balancing of data handling missing values Information theory attribute selection attribute grouping MiningMart prezentation (c) Petr Berka, LISp, 2001 17 … fuzzy discretization NClass(Int) NClass N(Int) < > N MiningMart prezentation (c) Petr Berka, LISp, 2001 18 … amount of noise head o o o o o body r r r r r smile y y y y n holding s s f b s jacket r r y y r tie y y n n y class + + Amount of noise: 20% max. possible accuracy = 80% MiningMart prezentation (c) Petr Berka, LISp, 2001 19 … data sampling random split into training and testing set select random stratified sample balance unbalanced classes MiningMart prezentation (c) Petr Berka, LISp, 2001 20 … handling missing values remove example substitute missing with new value substitute missing with majority value proportional substitution MiningMart prezentation (c) Petr Berka, LISp, 2001 21 … information theory Attribute selection - based on mutual information Attribute grouping - based on information content MiningMart prezentation (c) Petr Berka, LISp, 2001 22 Preprocessing architecture Input data procedure (ASCII) Data Output data (ASCII) procedure Results (ASCII) MiningMart prezentation (c) Petr Berka, LISp, 2001 23 SALOME software Feature Selection Toolbox (Multi-Purpose Tool for Pattern Recognition) feature selection approximation-based modeling classification a consulting system helping to choose the most suitable method is being developed MiningMart prezentation (c) Petr Berka, LISp, 2001 24 Search strategies for FS Search for a subset maximizing a criterion function (distance, divergence): with apriori information exhaustive search branch and bound based algorithms floating search algorithms without apriori information approximation method divergence method MiningMart prezentation (c) Petr Berka, LISp, 2001 25 FST architecture Data (ASCII) FST Results Windows MiningMart prezentation (c) Petr Berka, LISp, 2001 26 References LISp-Miner: Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt eds.) Proc. ECML'94, Springer 1994, 339-342. Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In: (Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244. Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow, Quafafou eds.) Principles of Data Mining and Knowledge Discovery. Springer 1998, 203 - 211. MiningMart prezentation (c) Petr Berka, LISp, 2001 27 References Preprocessing: Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag, 2000, 112-138. Pudil, P., Novovičová J.: Novel Methods for Subset Selection with Respect to Problem Knowledge, IEEE Transactions on Intelligent Systems - Special Issue on Feature Transformation and Subset Selection 1998, 66-74 J. Zvarova and M. Studeny: Information theoretical approach to constitution and reduction of medical data. International Journal of Medical Informatics 45 (1997), n. 1-2, pp. 65-74. MiningMart prezentation (c) Petr Berka, LISp, 2001 28