Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery in Databases T1: introduction Knowledge Discovery in Databases (Information Harvesting, Data Archeology, Data Mining, Knowledge Destilery, ....) Non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns from data (Fayyad a kol., 1996) Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets (Adriaans, Zantinge, 1999) Analysis of observational data sets to find unsuspected relationships and summarize data in novel ways that are both understandable and useful to the data owner (Hand, Manilla, Smyth, 2001) Data mining is the process of analyzing hidden patterns of data from different perspectives and categorizing them into useful information (techopedia.org, 2011) Three sources databases (query languages, OLAP), statistics (data analysis), artificial intelligence (machine learning) P. Berka, 2012 1/19 Knowledge Discovery in Databases T1: introduction KDD Tasks (Klosgen, Zytkow, 1997) classification/prediction: the task is to find knowledge applicable to automatically process new examples desription: the task is to find dominant structure or relationships P. Berka, 2012 2/19 Knowledge Discovery in Databases T1: introduction search for „nuggets“: the task is to find partial novel and surprising knowledge (Chapman a kol, 2000) data description and summarisation: concise description of characteristics of the data, typically in elementary and aggregated form segmentation: separation of the data into interesting and meaningful subgroups or classes concept description: understandable description of concepts or classes to gain insight P. Berka, 2012 3/19 Knowledge Discovery in Databases T1: introduction classification: build classification models (sometimes called classifiers) which assign the correct class label to previously unseen and unlabeled objects prediction: similar to classification, but the target attribute (class) is not a qualitative discrete attribute but a continuous one. Prediction also often deals with time dependent concepts dependency analysis: describe significant dependencies (or associations) between data items or events P. Berka, 2012 4/19 Knowledge Discovery in Databases T1: introduction Managerial viewpoint Manažerský problém Znalosti pro řešení 1. Řešitelský tým 7. Interpretace 2. Specifikace problému 6. Data mining 3. Získání dat 5.Předzpracování dat 4. Výběr metod Data processing viewpoint P. Berka, 2012 5/19 Knowledge Discovery in Databases T1: introduction Application areas of KDD Segmentation and classification (clients of a bank or insurance company), Credit Risk Assessment, Fraud detection Prediction of stock market prices, Prediction of energy consumption, Intrusion detection, Churn Analysis (telco services providers, internet providers), Microarray data analysis (molecular biology), Targeted marketing, Medical diagnosis, Market Basket Analysis. P. Berka, 2012 6/19 Knowledge Discovery in Databases T1: introduction Market basket analysis: data expolration Collected data – content of market baskets in transactional form Basket_id 10011 10011 10012 10012 10012 10012 10013 10014 10014 ... P. Berka, 2012 Item_id 152 37 1 152 785 6 10 15 811 ... 7/19 Knowledge Discovery in Databases T1: introduction Market basket analysis: dependency analysis P. Berka, 2012 8/19 Knowledge Discovery in Databases T1: introduction Market basket analysis: classification P. Berka, 2012 9/19 Knowledge Discovery in Databases T1: introduction KDD Standards 1. Methodologies (Marban a kol, 2009) 5A Developed in mid. 90th by SPSS. The name is an acronym for the performed steps: Assess – assess the requirements of the project, Access – access the available data, Analyze – perform the analyses, Act – turn knowledge into actions, Automate – deploy the models in an automatic way. P. Berka, 2012 10/19 Knowledge Discovery in Databases T1: introduction SEMMA Developed in mid. 90th by SAS: Sample the data by creating one or more data tables, Explore the data by searching for relationships, trends or anomalies, Modify the data by creating, selecting, and transforming the variables, Model the relationships between input and output variables by using various data mining techniques, Assess the quality of the models. P. Berka, 2012 11/19 Knowledge Discovery in Databases T1: introduction CRISP-DM Currently a de-facto standard supported by most data mining systems P. Berka, 2012 12/19 Knowledge Discovery in Databases T1: introduction 2. Standards to describe models Predictive Modeling Markup Language Standard based on XML developed at Data Mining Group (www.dmg.org), that allows to describe data, data transformations and created models. Main parts of a PMML document: Header Data Dictionary Data Transformations Model P. Berka, 2012 13/19 Knowledge Discovery in Databases T1: introduction <?xml version="1.0" ?> <PMML version="4.0"> <Header copyright="P.B." description="An example decision tree model."/> <DataDictionary numberOfFields="5" > <DataField name="income" optype="categorical" /> <Value value="low"/> <Value value="high"/> <DataField name=account" optype= categorical " /> <Value value="low"/> <Value value="medium"/> <Value value="high"/> <DataField name="sex" optype="categorical" > <Value value="male"/> <Value value="female"/> </DataField> <DataField name="unemployed" optype="categorical" > <Value value="yes"/> <Value value="no"/> </DataField> <DataField name=loan" optype="categorical" > <Value value="A"/> <Value value="n"/> </DataField> </DataDictionary> <TreeModel modelName="loan aproval decision tree" > <MiningSchema> <MiningField name=“income"/> <MiningField name="account"/> <MiningField name="sex"/> <MiningField name="unemployed"/> <MiningField name="loan" usageType="predicted"/> </MiningSchema> <Node score="A"> <True/> <Node score="A"> <SimplePredicate field="income" operator="equal" value="high"/> </Node> <Node score="n"> <SimplePredicate field="income" operator="equal" value="low"/> <Node score="A"> <SimplePredicate field="account" operator="equal" value="high"/> </Node> <Node score="n"> <SimplePredicate field="account" operator="equal" value="low"/> <Node score="n"> <SimplePredicate field="unemployed" operator="equal" value="yes“/> </Node> <Node score="A"> <SimplePredicate field="unemployed" operator="equal" value="no“/> </Node> </Node> </Node> </Node> </TreeModel> </PMML> P. Berka, 2012 14/19 Knowledge Discovery in Databases T1: introduction 3. Programming standards (API) SQL/MM Data Mining Standard interface that enables to access data mining algorithms from relational databases OLE DB for Data Mining API developed by Microsoft CREATE MINING MODEL CreditRisk ( CustomerId long key, Income text discrete, Account text discrete, Sex text discrete, Unemployed boolean discrete, Loan text discrete predict, ) USING [Microsoft Decision Tree] Java Data Mining P. Berka, 2012 15/19 Knowledge Discovery in Databases T1: introduction Data Mining Systems cover the whole KDD process (from data preprocessing to model evaluation), offer more data mining algorithms (than singlepurpose machine learning systems), focus on visualization (both in the way how to use the system and in the way how to present and interpret data and results). System Vendor URL SPM Salford Systems SPSS www.salford-systems.com Clementine Enterprise Miner GhostMiner SAS Institute Intelligent Miner KnowledgeSt udio Oracle Data Mining PolyAnalyst Statistica Data Miner IBM Fujitsu Angoss Oracle Megaputer StatSoft LISp Miner VŠE RapidMiner Rapid-I University of Weka Waikato P. Berka, 2012 www-01.ibm.com/software/analytics/ spss/products/modeler/ www.sas.com/technologies/analytics/ datamining/miner/ www.fqs.pl/business_intelligence/prod ucts/ghostminer www-01.ibm.com/software/data/ infosphere/warehouse/enterprise.html www.angoss.com www.oracle.com/us/products/database/ options/data-mining/index.html www.megaputer.com/ www.statsoft.com/products/datamining-solutions/ lispminer.vse.cz rapid-i.com/ www.cs.waikato.ac.nz/ml/weka/index. html 16/19 Knowledge Discovery in Databases T1: introduction Weka Rapid Miner P. Berka, 2012 17/19 Knowledge Discovery in Databases T1: introduction SAS Enterprise Miner IBM SPSS Modeler (Clementine) P. Berka, 2012 18/19 Knowledge Discovery in Databases P. Berka, 2012 T1: introduction 19/19