Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Extracting Knowledge from Past Data Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University Outline • An introduction to data mining • Challenging issues on data mining M.-S. Chen NTU 2 Data Mining • Data mining: Knowledge discovery in databases – extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases – Relevant fields: AI, database, statistics • We are buried in data, but looking for knowledge M.-S. Chen NTU 3 Knowledge Discovering Process Interpretation/ Evaluation Data Mining Knowledge Transformation Patterns Preprocessing Selection ……………… ……………… ……………… Transformed Data Preprocessed Data Data M.-S. Chen Target Data NTU 4 Mining Capabilities • • • • • • Association Classification Clustering Traversal patterns Sequential patterns and many others M.-S. Chen NTU 5 E.g., Mining Association Rules • Transaction data analysis: Mining association rules – Given: (1) a database of transactions (2) each tx has a list of items purchased • Find all asso. rules: the presence of one set of items implies the presence of another set of items in the same tx • Two primary approaches (1) Apriori-Based (2) FP-Tree-Based M.-S. Chen NTU 6 Two Parameters • Confidence (how true) – the rule X&Y => Z has 90% conf. means 90% of customers who bought X and Y also bought Z • Support (how useful the rule is) – useful rules should have some minimum tx support M.-S. Chen NTU 7 Applications • 依據不同產業需求提出產業別應用 金融 保險業 零售業 製造業 信用評 即時輔 生產過 等、客 助購買 程中作 製化金 決策之 為最佳 融服務、 依據, 化生產 授信、 並且提 因素決 客戶之 供貨品、 定之專 資產管 架位、 家輔助 理、壞 物流整 決策系 帳分析、 合及配 統,並 道德危 置之輔 且提供 機分析、 助決策 最佳化 逆向選 支援系 之存貨 擇風險 統 控管與 分析、 供應鏈 潛在客 暨顧客 戶名單 利潤率 分析 分析 M.-S. Chen 連鎖業 醫療業 電信業 生技業 作為展 作業成 提供最 提供研 店店址 本管理 佳化之 發平台 之選擇, 之動因 網路交 以及分 以及分 分析、 通配置, 析所需 店貨品 作為顧 暨、客 工具, 品項選 客利潤 製化服 加速累 擇,並 率分析、 務,並 積研發 且作為 或客戶 且提供 能量 物流倉 客製化 即時之 庫位址 服務之 線上客 決策輔 來源 製化輔 助工具, 助資訊 以及物 系統、 流產能 客製化 輔助配 之入口 置之依 網站及 據 輔助促 銷功能 NTU 教育業 廣告業 非營利 組織 作為潛 廣告點 在學生 閱來源 之來源 分析、 名單分 回應率 析,並 分析、 且運用 行銷策 資訊勘 略提供 測作為 入學申 請暨獎 學金申 請評等 之分析, 及學生 課程規 劃與職 涯規劃 之依據 作為勸 募捐款 信函與 通信之 聯繫名 單方式 8 Remarks • Data mining is very application dependent – Small team with good skill and domain knowledge • Lots of work has been done in other areas • Emerging issues: – Journals, ACM TODS, ACM TKDD (from 2007), IEEE TKDE, DMKD, KAIS, IS, Pattern Recognition – SIGKDD, ICDM (from 2001), SIAM-SDM (from 2001), SIGMOD, ICDE, VLDB, CIKM, ICML, SIGIR, WI, PAKDD, etc. M.-S. Chen NTU 9 What is the Next for Data Mining • • • • Privacy-preserving mining Data stream mining Mining for bioinformatics Mining to assist content-based data management M.-S. Chen NTU 10 Data Streams: Computation Model Synopsis in Memory Data Streams Stream Processing Engine (Approximate) Answer • Stream processing requirements – Single pass: Each record is examined at most once – Bounded storage: Limited Memory for storing synopsis – Real-time: Per record processing time must be low M.-S. Chen NTU 11 Outline • An introduction to data mining • Challenging Issues on data mining M.-S. Chen NTU 12 Challenging Issues for Data Mining • Identifying data source for desired knowledge – Mining purposes: knowledge or auxiliary meta data • Data collection methods (in Web, wireless, tx) – Different types of data from different environment • Usefulness and certainty of mining results – Support and confidence • Interactive mining with different data granularities – e.g., generalized association rules M.-S. Chen NTU 13 Issues (cont’d) • Mining in data streaming environments – Look at data only once; the amount of data is huge – incremental mining (temporal and spatial) • Efficiency and scalability of mining algorithms – Sampling methods (frequency tuned wrt data or wrt result accuracy) • Hardware-enhanced mining – E.g., PDA, STB, devices for LBS M.-S. Chen NTU 14 Issues (cont’d) • Interestingness of mining results – Have to know the original likelihood • Evaluation of mining results – How to measure the advantage gained • Expression of various kinds of mining results • Protection of privacy and data security – Data hiding M.-S. Chen NTU 15 Ongoing Works in NetDB Lab • • • • Web usage mining Web content mining Mining in mobile environments Scalable clustering techniques tuned with domain knowledge • Incremental mining (temporal and spatial) • Hardware-enhanced mining M.-S. Chen NTU 16 Summary • Data mining is an area of growing importance – Increasing demand for intelligence – Fast advance in IT techniques • Mining will be of increasing impact to Web and wireless applications. – Huge amount of digital data – Nature of applications and their users M.-S. Chen NTU 17 Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or data warehouse server Data cleaning Data integration Database M.-S. Chen Filtering Data warehouse NTU 18 Incremental Mining • Due to the increasing use of the recordbased databases, recent important applications have called for the need of incremental mining – Such applications include Web log records, stock market data, grocery sales data, transactions in electronic commerce, and daily weather/traffic records, to name a few M.-S. Chen NTU 19 Incremental Mining • To mine the transaction database for a fixed amount of most recent data (say, data in the last 12 months) • One has to not only include new data (i.e., data in the new month) into, but also remove the old data (i.e., data in the most obsolete month) from the mining process. M.-S. Chen NTU data for 1/2000 Pi data for 2/2000 Pi+1 dbi, j dbi+1, j+1 data for 12/2000 Pj data for 1/2001 Pj+1 20 E.g., Redundant Rules • For the same support and confidence, if we have a rule {a,d}=>{c,e,f,g}, what do we have – – – – {a,d}=>{c,e,f} {a}=>{c,e,f,g} {a,d,c}=>{e,f,g} {a}=>{d,c,e,f,g} M.-S. Chen NTU 21 E.g., Generalized Asso. Rules • Which data granularities should be used for data mining • To mine meaningful rules (proper data units) and be as specific as possible – similar dilemma for other mining capabilities M.-S. Chen NTU 22 Clothes Outerwear Jackets Shirts Hiking Boots Shoes Ski Pants Database Tx 100 200 300 400 500 600 M.-S. Chen Freg. Itemset Footwear Itemset support Jacket Outerwear Clothes Shoes Hiking Boots Footwear Outerwear, Hiking Boots Clothes, Hiking Boots Outerwear, Footwear Clothes, Footwear 2 3 4 2 2 4 2 2 2 2 Items bought Shirt Jacket, Hiking Boots Ski Pants, Hiking Boots Shoes Shoes Jacket sup(30%) conf(60%) Outerwear → Hiking Boots 33% 66% Outerwear → Footwear 33% 66% Hiking Boots → Outwear 33% 100% Hiking Boots → Clothes 33% 100% However, Jacket → Hiking Boots 16% 50% Ski Pants → Hiking Boots 16% 100% NTU 23 E.g., Interestingness of Rules • In a school of 5000 students – 60% (3000) play basketball and 75% (3750) eat cereal; and 40% (2000) do both • Say, minimal sup is 2000 and min conf is 60%, one gets the rule – “play basketball => eat cereal” so ... does that mean promoting the basketball activities will help the sales of cereal? M.-S. Chen NTU 24 Interestingness (Cont’d) • In fact, P(A and B)/P(A) should be greater than P(B) to make the rule “A=>B” be interesting – how about for the rule {A,K,}=>{B,L,V} to be interesting M.-S. Chen NTU 25 Related Training • Database • AI: machine learning • Statistics M.-S. Chen NTU 26