Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Related Concepts Outline Goal: Examine some areas which are related to data mining. Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Data Warehousing OLAP/DSS Statistics Machine Learning Pattern Matching 1 Ming-Yen Lin, IECS, FCU DB & OLTP Systems On-Line Transaction Processing Schema (ID,Name,Address,Salary,JobNo) Data Model Entity-Relationship Relational Transaction Query: SELECT Name FROM T WHERE Salary > 100000 [Fig. 2.1] DM: Only imprecise queries 2 Ming-Yen Lin, IECS, FCU Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function {x|x R and x.salary > 100,000} vs. {x|xR and x is tall} DM: Prediction and classification are fuzzy. Ming-Yen Lin, IECS, FCU 3 Fuzzy Sets & Fuzzy Logic Fuzzy logic: reasoning with uncertainty; multiple valued logic retrieve data with imprecise/missing values mem(x) = 1- mem(x); mem(xy) = min(mem(x), mem(y)) mem(xy) = max(mem(x), mem(y)) 4 Ming-Yen Lin, IECS, FCU Classification/Prediction is Fuzzy Grey area Loan Reject Reject Amnt Accept Simple Accept Fuzzy 5 Ming-Yen Lin, IECS, FCU Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. 6 Ming-Yen Lin, IECS, FCU Information Retrieval (cont’d) Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. sim(q,Di); sim(Di, Dj) Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant| Inverse Document Frequency: IDFk = log(n/|documents containing k|) + 1 Concept hierarchy [Fig. 2.7] Replace ‘tiger’ with ‘CAT’ May be a Directed Acyclic Graph 7 Ming-Yen Lin, IECS, FCU IR Query Result Measures and Classification calculate precision/recall IR Classification 8 Ming-Yen Lin, IECS, FCU Decision Support Systems Improve decision making by providing specific information needed by management Executive information systems Executive Support Systems as a suite of tools, assist in the overall DSS process 9 Ming-Yen Lin, IECS, FCU Dimensional Modeling a different way to view and interrogate data in DB View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional. Ming-Yen Lin, IECS, FCU 10 Relational View of Data ProdID 123 123 150 150 150 150 200 300 500 500 LocID Dallas Houston Dallas Dallas Fort Worth Chicago Seattle Rochester Bradenton Chicago Date 022900 020100 031500 031500 021000 Quantity 5 10 1 5 5 UnitPrice 25 20 100 95 80 012000 030100 021500 022000 012000 20 5 200 15 10 75 50 5 20 25 1 11 Ming-Yen Lin, IECS, FCU Dimensional Modeling Queries Roll Up: more general dimension Drill Down: more specific dimension Dimension (Aggregation) Hierarchy SQL uses aggregation Multidimensional schemas star schema snowflake schema fact constellation schema Multidimensional indexing bitmap index, join index Ming-Yen Lin, IECS, FCU 12 Cube view of Data 13 Ming-Yen Lin, IECS, FCU Aggregation Hierarchies order relationship second < minute aggregate sum additive 14 Ming-Yen Lin, IECS, FCU Star Schema Day product Sales Division Ming-Yen Lin, IECS, FCU dimension facts Location aggregate facts for efficiency 15 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city province_or_street country Measures 16 Ming-Yen Lin, IECS, FCU Options to implement star schema (a) flattened: store data for each dimension in exactly one table; roll up: by SQL aggregate (b) normalized: a table exists for each level in each dimension; each table has one tuple for every occurrence at the level (c) expanded: num. of dimen. tables = normalized; lowest dim. = flattened (d) levelized: has one dim. table as does the flattened, but aggregations have been performed. [Fig. 2.12] Ming-Yen Lin, IECS, FCU 17 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city province_or_street country 18 Ming-Yen Lin, IECS, FCU Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key shipper_key location to_location location_key street city province_or_street country dollars_cost Measures Galaxy schema Ming-Yen Lin, IECS, FCU time_key from_location branch_key branch Shipping Fact Table units_shipped shipper shipper_key shipper_name location_key shipper_type 19 Data Warehousing “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse. 20 Ming-Yen Lin, IECS, FCU What is Data Warehouse? 定義 一個分別設置的,獨立於公司作業資料庫的,決策支 援資料庫 為支援資料處理,提供分析之用,提供完善的、統合 歷史資料的平台 “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing 建構與使用 data warehouses的程序 21 Ming-Yen Lin, IECS, FCU D. W.—Subject-Oriented 依主要主題而組織,如 customer, product, sales 焦點集中在決策者要的資料模型或分析,不 在日常作業或交易處理 去除決策資源程序中無用的資料,提供簡化 的、精簡的(環繞於特定主題的)view 22 Ming-Yen Lin, IECS, FCU Data Warehouse—Integrated 藉整合多個、異質的資料來源而建構 relational databases flat files on-line transaction records 應用data cleaning 與 data integration的技巧 確保不同資料來源的一致性 naming conventions encoding structures attribute measures 例:Hotel price: currency, tax, breakfast covered, etc. 當資料「移動」到 warehouse時,已經經轉換 23 Ming-Yen Lin, IECS, FCU Data Warehouse—Time Variant data warehouse 的時間軸明顯的比作業性系統長 Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) data warehouse的各主要結構(key structure) 外顯或隱含地(explicitly or implicitly) 包含 time 這個元素 operational data:不一定包含“time element” 24 Ming-Yen Lin, IECS, FCU Data Warehouse—Non-Volatile 由作業環境中的資料轉換得到的、實質 上獨立的儲存(physically separate store) data warehouse 不含操作性的更新 不需交易處理、復原、協同控制 (concurrency control) 機制 僅需兩種操作 資料的初始載入 資料的取用 25 Ming-Yen Lin, IECS, FCU Data Warehousing traditional db: operational data data warehouse: information data ‘what if’ questions -> warehouse + query eg. analyze trend from historical data basic components data migration warehouse access tool 26 Ming-Yen Lin, IECS, FCU Transformation in DWing Transformation [Fig. 2.14] remove unwanted data convert heterogeneous source into one common format merge snapshots to create historical view summarize data at levels add derived data handling missing/erroneous data also called data scrubbing/data staging Improve performance of data warehouse applications Summarization Denormalization (speed up join!) Partitioning 27 Ming-Yen Lin, IECS, FCU Operational vs. Informational Operational Data Data Warehouse Application OLTP OLAP Use Precise Queries Ad Hoc Temporal Snapshot Historical Modification Dynamic Static Orientation Application Business Data Operational Values Integrated Size Level Gigabits Detailed Terabits Summarized Access Often Less Often Response Few Seconds Minutes Data Schema Relational Star/Snowflake 28 Ming-Yen Lin, IECS, FCU OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Visualization of operations: Slice: examine sub-cube. Dice: rotate cube to look at another dimension. Roll Up/Drill Down DM: May use OLAP queries. Ming-Yen Lin, IECS, FCU 29 A Concept Hierarchy Dimension (location) all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... ... Toronto M. Wind Used for multi-level abstraction (for interactive mining) Ming-Yen Lin, IECS, FCU Mexico 30 典型的 OLAP 運算 Roll up (drill-up): 綜合資料 by climbing up hierarchy or by dimension reduction Drill down (roll down): roll-up的相反 from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: (選取部分) project and select Pivot (rotate): (旋轉) reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL) 31 Ming-Yen Lin, IECS, FCU Cube Operations dice (location=x AND time=Y AND item = Z) roll-up (city2location) drill-down (quarter2month) slice (time=Q1) pivot 32 Ming-Yen Lin, IECS, FCU OLAP Operations Roll Up Drill Down Single Cell Multiple Cells Slice Dice OLAP tools: ROLAP (relational) or MOLAP (multidimentional) ROLAP: a ROLAP server (middleware) creates MD view for users MOLAP: specialized DBMS & s/w to directly support MD data OR Hybrid tool 33 Ming-Yen Lin, IECS, FCU Web Search Engines be viewed as query systems like IR systems query: keyword, boolean, weighted, … Conventional search engines suffer Abundance Limited coverage Limited query Limited customization Web Mining content/structure/usage Web search => content mining 34 Ming-Yen Lin, IECS, FCU Statistics Simple descriptive models Statistical inference: generalizing a model created from a sample of the data to the entire dataset. Exploratory Data Analysis: Data can actually drive the creation of the model Opposite of traditional statistical view. Data mining targeted to business user DM: Many data mining methods come from statistical techniques. Ming-Yen Lin, IECS, FCU 35 Machine Learning Machine Learning: area of AI that examines how to write programs that can learn. Often used in classification and prediction Supervised Learning: learns by example. Unsupervised Learning: learns without knowledge of correct answers. Machine learning often deals with small static datasets. [table 2.3] DM: Uses many machine learning techniques. Ming-Yen Lin, IECS, FCU 36 Pattern Matching (Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis. DM: Type of classification. 37 Ming-Yen Lin, IECS, FCU DM vs. Related Topics Area Query Data DB/OLTP Precise Database IR OLAP DM Results Output Precise DB Objects or Aggregation Precise Documents Vague Documents Analysis Multidimensional Precise DB Objects or Aggregation Vague Preprocessed Vague KDD Objects 38 Ming-Yen Lin, IECS, FCU