Download Recent issues in data mining

Data Mining: Extracting Knowledge from Past Data Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University Outline • An introduction to data mining • Challenging issues on data mining M.-S. Chen NTU 2 Data Mining • Data mining: Knowledge discovery in databases – extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases – Relevant fields: AI, database, statistics • We are buried in data, but looking for knowledge M.-S. Chen NTU 3 Knowledge Discovering Process Interpretation/ Evaluation Data Mining Knowledge Transformation Patterns Preprocessing Selection ……………… ……………… ……………… Transformed Data Preprocessed Data Data M.-S. Chen Target Data NTU 4 Mining Capabilities • • • • • • Association Classification Clustering Traversal patterns Sequential patterns and many others M.-S. Chen NTU 5 E.g., Mining Association Rules • Transaction data analysis: Mining association rules – Given: (1) a database of transactions (2) each tx has a list of items purchased • Find all asso. rules: the presence of one set of items implies the presence of another set of items in the same tx • Two primary approaches (1) Apriori-Based (2) FP-Tree-Based M.-S. Chen NTU 6 Two Parameters • Confidence (how true) – the rule X&Y => Z has 90% conf. means 90% of customers who bought X and Y also bought Z • Support (how useful the rule is) – useful rules should have some minimum tx support M.-S. Chen NTU 7 Applications • 依據不同產業需求提出產業別應用金融保險業零售業製造業信用評即時輔生產過等、客助購買程中作製化金決策之為最佳融服務、依據，化生產授信、並且提因素決客戶之供貨品、定之專資產管架位、家輔助理、壞物流整決策系帳分析、合及配統，並道德危置之輔且提供機分析、助決策最佳化逆向選支援系之存貨擇風險統控管與分析、供應鏈潛在客暨顧客戶名單利潤率分析分析 M.-S. Chen 連鎖業醫療業電信業生技業作為展作業成提供最提供研店店址本管理佳化之發平台之選擇，之動因網路交以及分以及分分析、通配置，析所需店貨品作為顧暨、客工具，品項選客利潤製化服加速累擇，並率分析、務，並積研發且作為或客戶且提供能量物流倉客製化即時之庫位址服務之線上客決策輔來源製化輔助工具，助資訊以及物系統、流產能客製化輔助配之入口置之依網站及據輔助促銷功能 NTU 教育業廣告業非營利組織作為潛廣告點在學生閱來源之來源分析、名單分回應率析，並分析、且運用行銷策資訊勘略提供測作為入學申請暨獎學金申請評等之分析，及學生課程規劃與職涯規劃之依據作為勸募捐款信函與通信之聯繫名單方式 8 Remarks • Data mining is very application dependent – Small team with good skill and domain knowledge • Lots of work has been done in other areas • Emerging issues: – Journals, ACM TODS, ACM TKDD (from 2007), IEEE TKDE, DMKD, KAIS, IS, Pattern Recognition – SIGKDD, ICDM (from 2001), SIAM-SDM (from 2001), SIGMOD, ICDE, VLDB, CIKM, ICML, SIGIR, WI, PAKDD, etc. M.-S. Chen NTU 9 What is the Next for Data Mining • • • • Privacy-preserving mining Data stream mining Mining for bioinformatics Mining to assist content-based data management M.-S. Chen NTU 10 Data Streams: Computation Model Synopsis in Memory Data Streams Stream Processing Engine (Approximate) Answer • Stream processing requirements – Single pass: Each record is examined at most once – Bounded storage: Limited Memory for storing synopsis – Real-time: Per record processing time must be low M.-S. Chen NTU 11 Outline • An introduction to data mining • Challenging Issues on data mining M.-S. Chen NTU 12 Challenging Issues for Data Mining • Identifying data source for desired knowledge – Mining purposes: knowledge or auxiliary meta data • Data collection methods (in Web, wireless, tx) – Different types of data from different environment • Usefulness and certainty of mining results – Support and confidence • Interactive mining with different data granularities – e.g., generalized association rules M.-S. Chen NTU 13 Issues (cont’d) • Mining in data streaming environments – Look at data only once; the amount of data is huge – incremental mining (temporal and spatial) • Efficiency and scalability of mining algorithms – Sampling methods (frequency tuned wrt data or wrt result accuracy) • Hardware-enhanced mining – E.g., PDA, STB, devices for LBS M.-S. Chen NTU 14 Issues (cont’d) • Interestingness of mining results – Have to know the original likelihood • Evaluation of mining results – How to measure the advantage gained • Expression of various kinds of mining results • Protection of privacy and data security – Data hiding M.-S. Chen NTU 15 Ongoing Works in NetDB Lab • • • • Web usage mining Web content mining Mining in mobile environments Scalable clustering techniques tuned with domain knowledge • Incremental mining (temporal and spatial) • Hardware-enhanced mining M.-S. Chen NTU 16 Summary • Data mining is an area of growing importance – Increasing demand for intelligence – Fast advance in IT techniques • Mining will be of increasing impact to Web and wireless applications. – Huge amount of digital data – Nature of applications and their users M.-S. Chen NTU 17 Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or data warehouse server Data cleaning Data integration Database M.-S. Chen Filtering Data warehouse NTU 18 Incremental Mining • Due to the increasing use of the recordbased databases, recent important applications have called for the need of incremental mining – Such applications include Web log records, stock market data, grocery sales data, transactions in electronic commerce, and daily weather/traffic records, to name a few M.-S. Chen NTU 19 Incremental Mining • To mine the transaction database for a fixed amount of most recent data (say, data in the last 12 months) • One has to not only include new data (i.e., data in the new month) into, but also remove the old data (i.e., data in the most obsolete month) from the mining process. M.-S. Chen NTU data for 1/2000 Pi data for 2/2000 Pi+1 dbi, j dbi+1, j+1 data for 12/2000 Pj data for 1/2001 Pj+1 20 E.g., Redundant Rules • For the same support and confidence, if we have a rule {a,d}=>{c,e,f,g}, what do we have – – – – {a,d}=>{c,e,f} {a}=>{c,e,f,g} {a,d,c}=>{e,f,g} {a}=>{d,c,e,f,g} M.-S. Chen NTU 21 E.g., Generalized Asso. Rules • Which data granularities should be used for data mining • To mine meaningful rules (proper data units) and be as specific as possible – similar dilemma for other mining capabilities M.-S. Chen NTU 22 Clothes Outerwear Jackets Shirts Hiking Boots Shoes Ski Pants Database Tx 100 200 300 400 500 600 M.-S. Chen Freg. Itemset Footwear Itemset support Jacket Outerwear Clothes Shoes Hiking Boots Footwear Outerwear, Hiking Boots Clothes, Hiking Boots Outerwear, Footwear Clothes, Footwear 2 3 4 2 2 4 2 2 2 2 Items bought Shirt Jacket, Hiking Boots Ski Pants, Hiking Boots Shoes Shoes Jacket sup(30%) conf(60%) Outerwear → Hiking Boots 33% 66% Outerwear → Footwear 33% 66% Hiking Boots → Outwear 33% 100% Hiking Boots → Clothes 33% 100% However, Jacket → Hiking Boots 16% 50% Ski Pants → Hiking Boots 16% 100% NTU 23 E.g., Interestingness of Rules • In a school of 5000 students – 60% (3000) play basketball and 75% (3750) eat cereal; and 40% (2000) do both • Say, minimal sup is 2000 and min conf is 60%, one gets the rule – “play basketball => eat cereal” so ... does that mean promoting the basketball activities will help the sales of cereal? M.-S. Chen NTU 24 Interestingness (Cont’d) • In fact, P(A and B)/P(A) should be greater than P(B) to make the rule “A=>B” be interesting – how about for the rule {A,K,}=>{B,L,V} to be interesting M.-S. Chen NTU 25 Related Training • Database • AI: machine learning • Statistics M.-S. Chen NTU 26

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Recent issues in data mining