Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Data Mining Outline PART I Introduction Related Concepts Data Mining Techniques PART II Classification Clustering Association Rules PART III Web Mining Spatial Mining Temporal Mining 2 Ming-Yen Lin, IECS, FCU Introduction Outline Goal: Provide an overview of data mining. Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues 3 Ming-Yen Lin, IECS, FCU Introduction Data is growing at a phenomenal rate Users expect more sophisticated information simple listing vs. purchase detail How? UNCOVER HIDDEN INFORMATION DATA MINING 4 Ming-Yen Lin, IECS, FCU Data Mining Definition Finding hidden information in a database Fit data to a model Similar terms Exploratory data analysis Data driven discovery Deductive learning ... 5 Ming-Yen Lin, IECS, FCU 資料探勘:各種名稱 資料庫之知識發現、樣式探勘、知識挖掘、知識擷取、 資料挖掘、資訊收割、資料分析、企業智慧、資料考古 Knowledge knowledge Pattern Knowledge Discovery extraction Mining Discovery in Databases data/pattern (KDD) Data information analysis harvesting Mining Data Data Dredging business intelligence Archeology 資料探勘、資料挖掘、資料採礦、資料勘測、知識挖掘 資料探勘:由(儲存於資料庫的)大量資料中 查詢與擷取(通常)過去未知的、 有用的知識、模式或趨勢 的過程 6 Ming-Yen Lin, IECS, FCU Database Processing vs. Data Mining Processing [Fig. 1.1] Query Query Well defined SQL Data Poorly defined No precise query language – Operational data Output – Precise – Subset of database Data – Not operational data Output – Fuzzy – Not a subset of database 7 Ming-Yen Lin, IECS, FCU Query Examples Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) – [ex. 1.1: D.M. helps to authorize a credit card transaction: 4 classes] Ming-Yen Lin, IECS, FCU 8 Data Mining Algorithm Objective: Fit Data to a Model Characterize D.M. Algorithms as 3 parts Model Preference – Criteria to fit the best model Search – Technique to search the data [ex. 1.1 illustrated] Models Predictive: predict about values of data Descriptive: identify patterns/relationships in data [explore the properties of data] 9 Ming-Yen Lin, IECS, FCU Data Mining Models and Tasks illustrative examples only, not exhaustive listing 10 Ming-Yen Lin, IECS, FCU Predictive Data Mining Classification maps data into predefined groups or classes Supervised learning examples: loan, credit risk Pattern recognition: a type of classification example: airport security screening -- face patterns Regression is used to map a data item to a real valued prediction variable. linear regression, error analysis to find the best Prediction: predict future data (rather than current data) flooding, speech recognition, … data collected by the sensors upriver…w.r.t. time 11 Ming-Yen Lin, IECS, FCU Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior: Y[6..20] is similar to Z[13..27] 12 Ming-Yen Lin, IECS, FCU Descriptive Data Mining Clustering groups similar data together into clusters. [vs. classification] Unsupervised learning Segmentation/Partitioning data example: demographic groups & specialized catalogs Summarization maps data into subsets with associated simple descriptions. Characterization/Generalization Link Analysis uncovers relationships among data. Affinity Analysis/Associations Association Rules [store example] Sequential Analysis (sequence discovery) determines sequential patterns. Ming-Yen Lin, IECS, FCU 13 Data Mining 功能 (I) 概念描述:特徵與區別(Concept description: Characterization and discrimination) 廣義化、綜合(Generalize, summarize) 對比資料的特性(contrast data characteristics) 關連(Association :correlation and causality相關、因果) Diaper -> Beer [0.5%, 75%] 分類與預測(Classification and Prediction ) 建立模型(函數)以描述與分辨類別或概念,作為未來預測用 例:classify countries based on climate, or classify cars based on gas mileage 預測某些未知的、或遺失的(missing) 數值 14 Ming-Yen Lin, IECS, FCU Data Mining 功能 (II) 群聚分析 (Cluster analysis) 類別標籤未知: 把資料依相似性分群 e.g., cluster houses to find distribution patterns maximizing intra-class similarity minimizing interclass similarity 離群分析 (Outlier analysis) outlier: 某資料object,無法符合資料的一般行為(模式) 雜質noise?例外exception? No! 用在fraud detection, rare events analysis 趨勢與演進 (Trend and evolution analysis) trend and deviation(偏差) : regression analysis sequential pattern mining periodicity analysis similarity-based analysis Estimation, Visualization 15 Ming-Yen Lin, IECS, FCU Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 16 Ming-Yen Lin, IECS, FCU KDD Process Modified from [FPSS96C] Selection: Obtain data from various (heterogeneous) sources. Preprocessing: Cleanse (incorrect/missing) data. Transformation: Convert to common format; Transform to new format; Reduce data amount Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Ming-Yen Lin, IECS, FCU 17 資料探勘:KDD的程序 Data mining: the core of knowledge discovery process. 核心程序 Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Ming-Yen Lin, IECS, FCU 18 KDD: Knowledge Discovery in Database KDD Process (Interactive and iterative)互動、反覆 Learning the application domain (relevant prior knowledge & goals of application)學習應用領域及相關知識 Steps 資料選擇(data selection:creating a target data set) 資料清理與前置處理(data cleaning & preprocessing :may take 60% of effort!) 資料簡化與轉換(data reduction & transformation:find useful features, dimensionality/variable reduction, invariant representation) 資料探勘 (choose function: summarization/ classification/ clustering regression/ association choose algorithms search for interest patterns) 模式評估與知識呈現 (Pattern evaluation & knowledge presentation: visualization, transformation) Ming-Yen Lin, IECS, FCU 19 KDD Process Ex.: Web Log Selection: Select log data (dates and locations) to use Preprocessing: Remove identifying URLs Remove error logs Transformation: Sessionize (sort and group) Data Mining: Identify and count patterns Construct data structure Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization 20 Ming-Yen Lin, IECS, FCU Visualization Techniques Graphical bar chart, pie charts, histograms, line graphs Geometric box plot, scatter diagram Icon-based figures, colors Pixel-based unique colored pixel Hierarchical Hybrid Ming-Yen Lin, IECS, FCU 21 Data Mining Development •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures [Table 1.1] Ming-Yen Lin, IECS, FCU •Neural Networks •Decision Tree Algorithms 22 資料探勘的技術 決策支援 Decision Support 統計 Statistics 機器學習 Machine Learning Ming-Yen Lin, IECS, FCU 資料庫管理 與資料倉儲 Database Management & Warehousing 資料探勘 Data Mining 其他 Others 平行處理 Parallel Processing 視覺化 Visualization 演算法 Algorithm 23 資料庫技術的演進 1960s 資料收集 Data collection, database creation, information management systems and network DBMS 1970s 資料庫 Relational data model, relational DBMS implementation 1980s 進階資料庫 RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s 資料探勘 Data mining and data warehousing, multimedia databases, and Web databases Ming-Yen Lin, IECS, FCU 24 D. M. Implementation Issues Human Interaction domain experts/technical experts Overfitting model does not fit future states Outliers Interpretation expert/common users Visualization Large Datasets High Dimensionality Ming-Yen Lin, IECS, FCU 25 Implementation Issues (cont’d) Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration into traditional DBMS Application determine the intended use, business practice 26 Ming-Yen Lin, IECS, FCU Data Mining – 什麼樣的資料? Relational databases關連式資料庫 Data warehouses資料倉儲 Transactional databases交易資料 Advanced DB & information repositories(儲藏) Object-oriented and object-relational databases Spatial (空間)databases Time-series (時序)data & temporal (時間的)data Text databases & multimedia databases Heterogeneous (異質)& legacy(傳統) databases WWW 27 Ming-Yen Lin, IECS, FCU Data Mining Metrics Effectiveness/Usefulness measure Return on Investment (ROI) Accuracy in classification Space/Time complexity analysis 28 Ming-Yen Lin, IECS, FCU Social Implications of DM Privacy Profiling Unauthorized use 29 Ming-Yen Lin, IECS, FCU Database Perspective on Data Mining Scalability Real World Data: noisy, missing values Updates Ease of Use abstraction of data definition/access primitives, query processing support 30 Ming-Yen Lin, IECS, FCU 典型資料探勘系統的架構 Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse 31 Ming-Yen Lin, IECS, FCU The Future DMQL (data mining query language) access to concept hierarchy example (p.18) rule_spec generalized relation/characteristic rule/discriminate rule/classification rule KDD process model: CRISP-DM (CrossIndustry Standard Process for Data Mining) 5A: assess, access, analyze, act, automate 32 Ming-Yen Lin, IECS, FCU 參考網站 KDD http://www.kdnuggets.com/ http://www.acm.org/sigkdd/ http://www.acm.org/sigmod/ Ref. slides http://www.cs.uiuc.edu/~hanj/book Research papers http://www.researchindex.com/ http://www.google.com/ (p.20) Ming-Yen Lin, IECS, FCU 33