Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
한국정보통신기술협회/데이터기술위원회 2001년도 워크샵 SQL 데이타마이닝 표준 (SQL Standard for Data Mining) Oct. 2001 Hwan-Seung Yong Ewha Womans Univ. http://dblab.ewha.ac.kr/hsyong Data Mining Standard by ISO/IEC • Information Technology – Database Languages – SQL Multimedia and Application Packages – Part 6: Data Mining • FCD 13249-6, 2001-05-21 – Vote deadline 2001-10-05 • Other Part – – – – Oct. 2001 Framework Full-text Spatial Still Image H.S.Yong 2 Data Mining • Association rule – Given a set of purchase transactions (baskets) which contain a set of items. – Find rules of the form: If a purchase transaction contains item X and item Y then the purchase transaction also contains item Z in N% of all purchase transactions. – Example application: store layout. • Clustering/Segmentation – Given a set of rows with a set of fields. Find sets of rows with common characteristics - the so-called clusters. – Example application: customer mailings. Oct. 2001 H.S.Yong 3 Data Mining • Classification – Given a set of rows with a set of fields and a special field the so-called class label. Compute a classification model such that the class label can be predicted by using the model and a set of field values without the class label. – Example application: insurance risk prediction. • Regression – Regression is very similar to classification except for the type of the predicted value. Rather than predicting a class label, regression is predicting a continuous value. – Example application: customer ranking. Oct. 2001 H.S.Yong 4 Computational Patterns • Training Phase – Common to all data mining techniques, this is the phase in which the data mining model is computed. • Application Phase – Phase during which a row is evaluated against a data mining model and one or more values are computed • Test Phase – Phase that reads a set of rows containing values for the target field, evaluates each row as in the application phase, and compares the predicted value to the actual value in the target field. – only used for data mining classification and regression. Oct. 2001 H.S.Yong 5 Standard Language • By defining various user defined types (UDT) – Just specification not implementation • Data Mining Model Types – Storage and retrieval of data mining model values • Data Mining Setting Types – Define a target field and parameters for algorithms • Data Mining Application Result Types – The result of applying a mining model to a row • Data Mining Data Types • Status Code Oct. 2001 H.S.Yong 6 Data Mining Model Types • Type: DM_RuleModel Type – Result of association rules • Method – DM_impRuleModel (CHARACTER LARGE OBJECT(DM_MaxContentLength)) • import rule model expressed as PMML spec. • Return DM_RuleModel – DM_expRuleModel(): export rule model using PMML – DM_getNORules(): return number of rules – DM_getRuleTask(): return data mining task value • Data mining settings etc. Oct. 2001 H.S.Yong 7 Data Mining Settings Types • Setting Mining parameters • DM_RuleSettings • Method – setMinSupport(DOUBLE PRECISION) – getMinSupport() – DM_ruleUseDataSpec(DM_LogicalDataSpec) • logicalDataSpec is an abstraction of source table for input data – DM_ruleGetDataSpec() – DM_ruleSetGroup(CHARACTER VARYING) • Identify grouping field for mining association • Ex) Purchase transaction etc – DM_ruleGetGroup(). Oct. 2001 H.S.Yong 8 Data Mining Application Result Types • The result of applying a mining model to a row • DM_ClusResult Type • Method – DM_getClusterID() • Return cluster identification number – DM_getQuality() • Degree of fitness to predicted cluster Oct. 2001 H.S.Yong 9 Data Mining Data Types • Represent input data needed for data mining • DM_LogicalDataSpec Type and Routines – Abstraction of set of field for mining • Method – – – – – DM_addDataSpecFld(CHARACTER VARYING), DM_remDataSpecFld(CHARACTER VARYING), DM_getNOFields(), DM_getFieldName(INTEGER), DM_setFieldType(CHARACTER VARYING, SMALLINT), • Two kinds of type: Categorical, Numeric – DM_getFieldType(CHARACTER VARYING), – DM_compatibleSpec(DM_LogicalDataSpec). Oct. 2001 H.S.Yong 10 Data Mining Data Types • DM_MiningData Type and Routines – Abstraction of input data for mining • Method – DM_defMiningData(CHARACTER VARYING) • Input is source table and define as mining data – DM_defFldAlias(CHARACTER VARYING, CHARACTER VARYING), • Define field name alias – DM_genLogDataSpec() • Generate value of type DM_LogicalDataSpec Oct. 2001 H.S.Yong 11 PMML • Predictive Model Markup Language – Easily define predictive model and share the models between companies – XML based – Driven by Data Mining Group • www.dmg.org • Consosium of Major Data Mining Vendors – Currently version 2.0 – Aug 26, 2001 2nd workshop on PMML was held and presented PMML 2.0 Oct. 2001 H.S.Yong 12 DTD of Association Rules Model <!ELEMENT AssociationModel (Extension*, AssocInputStats, AssocItem+, AssocItemset+, AssocRule+)> <!ATTLIST AssociationModel modelName CDATA #IMPLIED > <!ELEMENT AssocInputStats EMPTY> <!ATTLIST AssocInputStats numberOfTransactions %INT-NUMBER; #REQUIRED maxNumberOfItemsPerTA %INT-NUMBER; #IMPLIED avgNumberOfItemsPerTA %REAL-NUMBER; #IMPLIED minimumSupport %PROB-NUMBER; #REQUIRED minimumConfidence %PROB-NUMBER; #REQUIRED lengthLimit %INT-NUMBER; #IMPLIED numberOfItems %INT-NUMBER; #REQUIRED numberOfItemsets %INT-NUMBER; #REQUIRED numberOfRules %INT-NUMBER; #REQUIRED > Oct. 2001 H.S.Yong 13 PMML Example: Association Rule 1/2 • t1: Cracker, Coke, Water • t2: Cracker, Water • t3: Cracker, Water • t4: Cracker, Coke, Water Oct. 2001 <?xml version="1.0" ?> <PMML version="1.1"> <Header copyright="www.dmg.org" description="example model for association rules"/> <DataDictionary numberOfFields="1"/> <DataField name="item" optype="categorical"/> </DataDictionary> <AssociationModel> <AssocInputStats numberOfTransactions="4" numberOfItems="3" minimumSupport="0.6" minimumConfidence="0.5" numberOfItemsets="3" numberOfRules="2"/> <!-- We have three items in our input data --> <AssocItem id="1"value="Cracker"/> <AssocItem id="2"value="Coke"/> <AssocItem id="3"value="Water"/> <!-- and two frequent itemsets with a single item --> H.S.Yong 14 PMML Example: Association Rule 2/2 • t1: Cracker, Coke, Water • t2: Cracker, Water • t3: Cracker, Water • t4: Cracker, Coke, Water Oct. 2001 <AssocItemset id="1"support="1.0" numberOfItems="1"/> <AssocItemRef itemRef="1"/> </AssocItemset> <AssocItemset id="2" support="1.0" numberOfItems="1"/> <AssocItemRef itemRef="3"/> </AssocItemset> <!-- and one frequent itemset with two items. --> <AssocItemset id="3" support="1.0" numberOfItems="2"/> <AssocItemRef itemRef="1"/> <AssocItemRef itemRef="3"/> </AssocItemset> <!-- Two rules satisfy the requirements --> <AssocRule support="1.0" confidence="1.0" antecedent="1" consequent="2"/> <AssocRule support="1.0" confidence="1.0" antecedent="2" consequent="1"/> </AssociationModel> </PMML> H.S.Yong 15 Final Remarks • Data Mining is hot and promising area like DBMS • Standard activity – SQL Data Mining Standard is ready – PMML standard for exchange of mining result is ready – But no software yet • Further Research and Standard Area – SQL for Multimedia Data Mining Oct. 2001 H.S.Yong 16