Download Data

Introduction to Knowledge Discovery and Data Mining TuBao Ho (HieuChi Dam) School of Knowledge Science Japan Advanced Institute of Science and Technology 19 November 2005 The lecture aims to …  Provide basic concepts and techniques of knowledge discovery and data mining (KDD).  Emphasize on different kinds of data, different tasks to do with the data, and different methods to do the tasks.  Emphasize on the KDD process and important issues when using data mining methods. 19 November 2005 Outline 1. Why knowledge discovery and data mining? 2. Basic concepts of KDD 3. KDD techniques: classification, association, clustering, text and Web mining 4. Challenges and trends in KDD 5. Case study in medicine data mining 19 November 2005 Much more data around us than before We are living in the most exciting of times: Computer and computer networks 19 November 2005 Astronomical data Astronomy is facing a major data avalanche: Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source … 19 November 2005 Astronomical data Multi-wavelength data paint a more complete (and a more complex!) picture of the universe Infrared emission from interstellar dust Smoothed galaxy density map 19 November 2005 Earthquake data Japanese earthquakes 1961-1994 1932-1996 04/25/92 Cape Mendocino, CA 19 November 2005 Predict earthquakes  9 August 2004 (AP): Swedish geologists may have found a way to predict earthquakes weeks before they happen (current accurate warnings only come seconds before a quake).  Water samples taken 4,900 feet beneath the ground in northern Iceland show the content of several metals increased dramatically a few weeks before a magnitude 5.8 earthquake struck.  "We need a database over other earthquakes." The bedrock at the test site is basalt, which is also found in other earthquake-prone areas like Hawaii and Japan. 19 November 2005 Finance: the market data Data on price fluctuation throughout the day in the market 19 November 2005 Ishikawa’s monthly industrial data 250,000 200,000 150,000 100,000 50,000 大型小売店売上高 0 1 2 3 4 5 6 7 19 November 2005 Explosion of biological data 10,267,507,282 bases in 9,092,760 records. 19 November 2005 How biological data look like? A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller): …TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAA ATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGC TGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAA ATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAAT TTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAA GATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAAT ATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT… 19 November 2005 Text: huge sources of knowledge  Approximately 80% of the world’s data is held in unstructured formats (source: Oracle Corporation)  Web sites, digital libraries, … increase the volume of textual data  例：JAISTの図書館  オンライン参照可能なジャーナル数：4700 （280000論文/年）  1%（＝2820論文）を読むには、8論文／日がノルマ. 1960s: Easy 1980s Time-consuming 2000s Difficult Soon: impossible 19 November 2005 MEDLINE: a medical text database  The world's most comprehensive source of life sciences and biomedical bibliographic information, with nearly eleven million records (http://medline.cos.com)  About 40,000 abstracts on hepatitis (concerning our research project), below is one of them 36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis. Institute of Liver Studies, King's College Hospital, London, United Kingdom. Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals; and, (d) the final effector mechanism of tissue damage probably involves auto-antibodies reacting with liverspecific antigens expressed on hepatocyte surfaces, rather than direct T-cell cytotoxicity against hepatocytes. 19 November 2005 Web server access logs data Typical data in a server access log looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291 mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014 mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302 mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487 looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200 ... ... ... 19 November 2005 Web link data Internet Map [lumeta.com] Friendship Network [Moody ’01] Food Web [Martinez ’91] 19 November 2005 What do we want from the data?  Much more data of different kinds were collected than before.  Want to exploit the data, to extract new and useful information/knowledge in the data, such as       Which phenomenon can be seen from data when a disease occurred? What are properties of several metals in 4,900 feet beneath the ground? Is Japan stock market rising this week? How other researchers talked about “interferon effect”? etc. Want to draw valid conclusions from data. 19 November 2005 What statistics usually does? Statistics provides principles and methodology for designing the process of ( 統計学は、下記プロセスを設計する際の原理や方法論の基礎を提供する):    Data collection データ収集 Summarizing and Interpreting the data データ要約と解釈 Drawing conclusions or generalities 結論と一般性の記述 19 November 2005 Evolution of data processing Evolutionary Step Business Question Enabling Technology Data Collection (1960s) “What was my total revenue in the last five years?” computers, tapes, disks Data Access (1980s) “What were unit sales in Korea last March?” faster and cheaper computers with more storage, relational databases, structured query language (SQL), etc. Data Warehousing and Decision Support “What were unit sales in Korea last March? Drill down to Hokkaido”. faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases, data warehouses Data Mining “What’s likely to happen to Hokkaido unit sales next month? Why? faster and cheaper computers with more storage, advanced algorithms, massive databases Drill down: To move from summary information to the detailed data that created it （集計データを集計の元となる明細データへと掘り下げる操作). For example, adding totals from all the orders for a year creates gross sales for the year. Drilling down would identify the types of products that were most popular. 19 November 2005 KDD: Convergence of thee technologies Increasing computing power Statistical and learning algorithms KDD Improved data collection and management 19 November 2005 Increasing computing power 1.6 meters 1966 30MB JAIST’s CRAY XT3 計算ノード： CPU: AMD Opteron150 2.4GHz ×4×90 メモリ: 32GB ×90 ＝ 2.88TB CPU間接続: 3Dトーラス結合帯域幅: CPU-CPU間 7.68GB/s(双方向) Lab PC cluster: 16 nodes dual Intel Xeon 2.4 GHz CPU/512 KB cache 19 November 2005 How much information is there? Yotta  Soon everything can be recorded and indexed  Most bytes will never be seen by humans  What will be key technologies to deal with huge volumes of information sources? Everythin g Recorded All Books MultiMedia Tera . Movie [This page adapted from the invited talk of Jim Gray (Microsoft) at KDD’2003] Exa Peta All books (words) 20 TB contains 20 M books in LC Zetta A Photo A Book Giga Mega Kilo 19 November 2005 Relational databases A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes * and a set of tuples**. *: 顧客IDなどのデータ項目 **：データレコード(行) customer Cust-ID C1 name Smith, Sandy … item Item-ID I3 I8 … … name high-res-TV multidiscCDplayer brand Toshiba Sanyo … address 5463 E Hasting, Burnaby BC V5A 459, Canada … category high resolution multidisc … age 21 … type TV CD player … price $988.00 $369.00 … income $27000 credit-info 1 … . … … place-made Japan Japan … … supplier NIkoX MusicFont … cost $600.00 $120.00 … employee Emp-ID E35 … branch Branch-ID B1 … name Jones, Jane … name City square … category home entertainment … group manager … address 369 Cambie St., Vancouver, BC V5L 3A2, Canada … purchases Trans-ID T100 … cust-ID C1 . … empl-ID B55 … data 01/21/98 … time 15:45 … method-paid Visa … amount $1357.00 … salary $18,000 … commision 2% … Item-sold works-at Trnas-ID item-ID sty Empl-ID branch-ID T100 T100 … E55 … I3 I8 … 1 2 … B1 … 19 November 2005 Data warehouses A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. 複数のデータ資源からの情報を収集し、形式を統一して一箇所にある単一のデータ倉庫に格納 Data source in Hokkaido Data source in Kanazawa Data source in Busan Data source in Hongkong client Clean Transform Integrate Load Data warehouse Query and analysis tool client 19 November 2005 Transactional databases  A transactional database consists of a file where each record represents a transaction.  A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction. Trans_ID T100 T200 T300 T400 T500 list of item_ID beer, cake, onigiri beer, cake beer, onigiri beer, onigiri cake 19 November 2005 Advanced database systems  Object-Oriented Databases  Object-Relational Databases  Spatial Databases  Temporal Databases and Time-Series Databases  Text Databases and Multimedia Databases  Heterogeneous Databases and Legacy Databases  The World Wide Web 19 November 2005 Spatial databases  Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.  Data mining may uncover patterns describing the content of several metals in specific location when earthquakes happen, the climate of mountainous areas located at various altitudes, etc. Japanese earthquakes 1961-1994 19 November 2005 Temporal and time-series databases  They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)  Data mining finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies  Spatial-temporal databases 19 November 2005 Text and multimedia databases  Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.  Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-ondemand-systems, speech-based user interface, etc. 19 November 2005 The world wide web The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need. Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors. 19 November 2005 Statistical and learning algorithms   Techniques have often been waiting for computing technology to catch up Development and improvement of statistical and learning algorithms during last decades: support vector machine and kernel methods, multi-relational data mining, graphbased learning, finite state machines, etc. 1960s t-1 2000s t t+1 Ot -1 Ot t-1 ... ... observations ... ... ... transitions Ot+1 HMMs (Hidden Markov Models: directed graph, joint, generative) t t+1 transitions ... St-1 St St+1 ... observations Ot -1 Ot Ot+1 MEMMs (Maximum Entropy Markov Models: directed graph, conditional, discriminative) Ot-1 Ot Ot+1 ... CRFs (Conditional Random Fields: undirected graph, conditional, discriminative) 19 November 2005 独立成分分析(ICA) vs. 主成分分析(PCA)  Principal Component Analysis (PCA) finds directions of maximal variance in Gaussian data (second-order statistics). 主成分分析(PCA)：ガウス分布データにおいて分散が最大となる方向の発見 (一次統計).  Independent Component Analysis (ICA) finds directions of maximal independence in non-Gaussian data (higherorder statistics). 独立成分分析 (ICA)：非ガウス分布データにおいて独立性が最大となる方向の発見 (高次統計). 19 November 2005 ICA: 複数センサで取得した信号データの分離 Perform ICA Mic 1 Mic 2 Mic 3 Mic 4 Terry Te-Won Play Mixtures Scott Tzyy-Ping Play Components 19 November 2005 Need of powerful tools to analyze data  People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information.  How to acquire knowledge for knowledge–based systems remains as the main difficult and crucial artificial intelligence problem. 19 November 2005 Outline 1. Why knowledge discovery and data mining? 2. Basic concepts of KDD 3. KDD techniques: classification, association, clustering, text and Web mining 4. Challenges and trends in KDD 5. Case study in medicine data mining 19 November 2005 Data, information, and knowledge Metaphor: Data: rock; knowledge: ore. Miner? 事実および関係などからなる総合的な情報 (“検証された真実”) (E = mc2) 意味をもつデータ (object’s mass: measure of an object's resistance to changes in either the speed or direction of its motion) 未解釈の信号 25.1 27.3 21.6 … 19 November 2005 Data, information, and knowledge US$ K Mean of Debt = 18.4, Mean of Income = 34.5 Debt (information) Have defaulted on the loan Good status with the bank 0 33 Income (knowledge) “if income < $33K, then the person has defaulted on the loan” (income, debt) 1. ( 5.6, 8.5) 2. ( 6.0, 13.0) 3. (11.0, 12.0) 4. (11.0, 19.0) 5. (13.5, 10.0) 6. (16.5, 20.0) 7. (17.5, 15.0) 8. (17.5, 5.0) 9. (22.5, 25.0) 10. (26.0, 7.5) 11. (30,0, 9.0) 12. (30.0, 18.0) 13. (30.0, 30.0) 14. (31.0, 14.0) 15. (32.5, 25.0) 16. (38.0, 12.0) 17. (41.0, 9.0) 18. (41.0, 22.0) 19. (43.5, 12.5) 20. (44.0, 27.5) 21. (45.0, 22.5) 22. (48.0, 28.0) 23. (52.5, 21.0) 24. (53.5, 32.0) 25. (54.0, 27.5) 26. (57.5, 18.0) 27. (59.0, 18.0) 28. (62.5, 32.5) 29. (63.0, 18.0) 34.5, 18.4 19 November 2005 知識発見とデータマイニング (KDD: Knowledge Discovery and Data Mining) the automatic extraction of non-obvious, hidden knowledge from large volumes of data 大量データに潜在する未発見の知識の自動抽出 106-1012 bytes: データセット全体の把握・コンピュータメモリへの展開が困難な大規模データベースどのデータマイニングアルゴリズムを適用するか？どんな知識か？どう表現するか？ 19 November 2005 From data to knowledge Meningitis data, Tokyo Med. & Dental Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS ... numerical categorical missing class attribute IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [confidence = 87,5%] 19 November 2005 KDD: An interdisciplinary field Statistics Infer information from data (deduction and induction, mainly numeric data) KDD Databases Store, access, search, update data (deduction) Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) also Algorithmics, Visualization, Data warehouses, OLAP, etc. 19 November 2005 KDD: New and fast growing area KDD’95, 96, 97, 98, …, 04, 05 (ACM, America) PAKDD’97, 98, 99, 00, …, 04, 05 (Pacific & Asia) http://www.jaist.ac.jp/PAKDD-05 PKDD’97, 98, 99, 00, …, 04, 2005 (Europe) ICDM’01, 02,…, 04, 05 (IEEE), SDM’01, …, 04, 05 (SIAM) Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning. Projects “Knowledge Science”, “Discovery Science”, and “Active Mining Project” (2001-2004) 19 November 2005 The KDD process 5 a step in the KDD process consisting of methods that produce useful patterns or models from the data Maybe 70-90% of effort and cost in KDD 2 1 3 4 Putting the results in practical use Interpret and Evaluate discovered knowledge Data Mining Extract Patterns/Models Collect and Preprocess Data Understand the domain and Define problems KDD is inherently interactive and iterative 19 November 2005 Common tasks in the KDD process 1 Data organized by function Create/select target database Data warehousing Select sampling technique and sample data 2 3 5 Supply missing values Eliminate noisy data Normalize values Transform values Create derived attributes Select DM task (s) Select DM method (s) Extract knowledge Transform to different representation Query & report generation Aggregation & sequences Advanced methods Find important attributes & value ranges Test knowledge Refine knowledge 4 19 November 2005 Data schemas vs. mining methods Types of data           Flat data tables Relational database Temporal & Spatial Transactional databases Multimedia data Genome databases Materials science data Textual data Web data etc. Mining tasks and methods  Classification/Prediction        Decision trees Neural network Rule induction Support vector machines Hidden Markov Model etc. Description     Association analysis Clustering Summarization etc. Different data schemas 19 November 2005 Dataset: cancerous and healthy cells H1 Unsupervised data H2 color H3 C1 C3 H4 C2 C4 #nuclei Supervised data #tails class H1 light 1 1 healthy H2 dark 1 1 healthy H3 light 1 2 healthy H4 light 2 1 healthy C1 dark 1 2 cancerous C2 dark 2 1 cancerous C3 light 2 2 cancerous C4 dark 2 2 cancerous 19 November 2005 What to do? Primary tasks of KDD  Predictive mining tasks perform inference on the current data in order to make prediction or classification. （予測的マイニングの課題は，未知のデータの予測を目的として，現在のデータに関する推論を行うことである） Ex. If “color = dark” and “#nuclei =2” then cancerous  Descriptive mining tasks characterize the properties of the data in the database. （記述的マイニングの課題は，データベース中のデータの全般的特性を特徴付ける記述を与えることである） Ex. “Healthy cells mostly have one nuclei while cancerous ones have two” 19 November 2005 What to find? Patterns and models Patterns:  局所的な関係の要約 K417\K417\D2MS\D2MS.exe A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local). パターンとは，任意の関係に関する低次の要約であり，１つのパターンがカバーするデータ件数あるいは変数の数は少数である（局所的）  Ex. If “color = dark” and “#nuclei =2” then cancerous Models:データセットに関する大域的な記述  A model is a global description of a data set, a high level population or large sample perspective モデルとは，データセット，高次元の母集団，あるいは大多数のデータを視野に入れたに関する大域的な記述である．  A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc. モデルは，変数間の相関（回帰式）,クラスタ階層（クラスタリング），神経回路網，などを用いて表現する． 19 November 2005 Classification/Prediction Classification/prediction is the process of finding a set of models (or patterns) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown （分類*/予測**とは，データのクラス/概念を説明し識別する一連のモデルを見つけ出すプロセスであり，このモデルを用い，クラス情報が不明なデータの分類を予測する *カテゴリデータの場合 **数値データの場合）      Decision trees IF-THEN rules Neural networks Support vector machines etc. 19 November 2005 Classification—A two-step process Model construction H1 H2 H3 H4 Model usage Classification Algorithms Classifier (model) C1 training data Unknown case Cancerous? C2 If and Then color = dark # tails = 2 cancerous cell 19 November 2005 Criteria for classification methods  Predictive accuracy（予測精度）: the ability of the classifier to correctly predict unseen data  Speed: refers to computation cost  Robustness（頑健性）: the ability of the classifier to make correctly predictions given noisy data or data with missing values  Scalability（拡張性）: the ability to construct the classifier efficiently given large amounts of data  Interpretability（解釈容易性）: the level of understanding and insight that is provided by the classifier 19 November 2005 Description Description is the process of finding a set of patterns or models that describe properties of data (essentially the summary of data), for the purpose of understanding the data. （記述とは、データの理解を目的とし、そのデータを要約するような性質を表現するパターンあるいはモデル集合を発見する過程である）      Clustering Association mining Summarization Trend detection etc. 19 November 2005 Criteria for description methods  Interestingness: overall measure combining novelty, utility, simplicity, reliability and validity of discovered patterns/models.  Speed: refers to computation cost  Scalability（拡張性）: the ability to construct the classifier efficiently given large amounts of data  Interpretability（解釈容易性）: the level of understanding and insight that is provided by the classifier 19 November 2005 Outline 1. Why knowledge discovery and data mining? 2. Basic concepts of KDD 3. KDD techniques: classification, association, clustering, text and Web mining 4. Challenges and trends in KDD 5. Case study in medicine data mining 19 November 2005 Decision tree learning Learn to generate classifiers in the form of decision trees from supervised data. 19 November 2005 Mining with decision trees A decision tree is a flow-chart-like tree structure:     each internal node denotes a test on an attribute each branch represents an outcome of the test leaf nodes represent classes or class distributions The top-most node in a tree is the root node #nuclei 1 {H1, H2, H3, H4, C1, C2, C3, C4} 2 {H1, H2, H3, C1} {H4, C1, C3, C4} color? #tails light dark {H1, H3} H 1 H {H2} 1 2 {H2, C2} {H4} {C1, C3, C4} #tails H C 2 {C2} C 19 November 2005 Decision tree induction (DTI)  Decision tree generation consists of two phases    Tree construction（決定木構築）  Partition examples recursively based on selected attributes  At start, all the training objects are at the root Tree pruning （構築した木の枝刈）  Identify and remove branches that reflect noise or outliers K417\K417\D2MS\D2MS.exe Use of decision tree: Classify unknown objects（新事例の分類）  Test the attribute values of the object against the decision tree 19 November 2005 DTI general algorithm Two steps: recursively generate the tree (1-4), and prune the tree (5) 1. At each node, choose the “best” attribute by a given measure for attribute selection 2. Extend tree by adding new branch for each value of the attribute 3. Sorting training examples to leaf nodes 4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes 5. Prune the tree to avoid overfitting #nuclei 1 {H1, H2, H3, H4, C1, C2, C3, C4} 2 {H1, H2, H3, C1} {H4, C2, C3, C4} color? #tails light dark {H1, H3} H 1 H {H2} 1 2 {H2, C2} {H4} {C1, C3, C4} #tails H C 2 {C2} C 19 November 2005 Measures for attribute selection 19 November 2005 Training data for concept “play-tennis” Days D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Class N N Y Y Y N Y N Y Y Y Y Y N  A typical dataset in machine learning  14 objects belonging to two class {Y, N} are observed on 4 properties.  Dom(Outlook) = {sunny, overcast, rain}  Dom(Temperature) = {hot, mild, cool}  Dom(humidity) = {high, normal}  Dom(Wind) = {weak, strong} 19 November 2005 A decision tree for playing tennis temperature cool hot {D5, D6, D7, D9} {D1, D2, D3, D13} outlook sunny rain {D9} {D5, D6} yes true true {D2} {D7} yes no false {D5} {D6} no yes sunny high false sunny {D8, D11} normal true false {D11} yes o’cast yes rain {D4, D10,D14} yes wind {D3} o’cast {D12} humidity outlook rain outlook {D1, D3, D13} {D1, D3} {D1} no {D4, D8, D10, D11,D12, D14} wind o’cast wind mild humidity high {D8} {D4, D14} no wind true {D14} normal {D10} yes false {D4} {D3} null yes no yes 19 November 2005 A simple decision tree for playing tennis outlook sunny {D1, D2, D8 D9, D11} o’cast {D3, D7, D12, D13} humidity high {D1, D2, D8} no rain {D4, D5, D6, D10, D14} wind yes normal {D9, D10} yes true {D6, D14} false {D4, D5, D10} no yes This tree is much simpler as “outlook” is selected at the root. How to select good attribute to split a decision node? 19 November 2005 Which attribute is the best?  The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-]  If attributes “humidity” and “wind” split S into subnodes with proportions of positive and negative objects as below, which attribute is better? [9+, 5-] normal [6+, 1-] A1 = humidity high [3+, 4-] [9+, 5-] weak [6+, 2-] A2 = wind strong [4+, 2-] 19 November 2005 Entropy  Entropy characterizes the impurity (purity) of an arbitrary collection of objects (不純度をあらわす). S is the collection of positive and negative objects p is the proportion of positive objects in S p is the proportion of negative objects in S  In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively  Entropy is defined as follows Entropy(S) = −p log2p − p log2p 19 November 2005 Entropy The entropy function relative to a Boolean classification, as the proportion p of positive objects varies between 0 and 1. c Entropy(S)    pilog2pi i1 19 November 2005 Example From 14 examples of Play-Tennis, 9 positive and 5 negative objects (denote by [9+, 5-] ) Entropy([9+, 5-] ) = − (9/14)log2(9/14) = 0.940 − (5/14)log2(5/14) Notice: 1. Entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p = 1), then p is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0. 2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. 19 November 2005 Information gain measures the expected reduction in entropy We define a measure, called information gain, of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attribute Gain(S, A)  Entropy(S)   vV alue(A ) Sv S Entropy(S v ) where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v. 19 November 2005 Information gain measures the expected reduction in entropy Values(Wind) = {Weak, Strong}, S = [9+, 5-] Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-] Gain(S, Wind) = Entropy(S) −  v{Weak, Strong} Sv S Entropy(S v ) = Entropy(S) − (8/14)Entropy(Sweak) − (6/14)Entropy(SStrong) = 0.940 − (8/14)0.811 − (6/14)1.00 = 0.048 19 November 2005 Which attribute is the best classifier? S:[9+, 5-] E = 0.940 S:[9+, 5-] E = 0.940 Humidity Wind High Normal Weak Strong [3+, 4-] E = 0.985 [6+, 1-] E = 0.592 [6+, 2-] E = 0.811 [3+, 3-] E = 1.00 Gain(S, Humidity) = .940 − (7/14).985 − (7/14).592 = .151 Gain(S, Wind) = .940 − (8/14).811 − (6/14)1.00 = .048 19 November 2005 Information gain of all attributes Gain (S, Outlook) = 0.246 Gain (S, Humidity) = 0.151 Gain (S, Wind) = 0.048 Gain (S, Temperature) = 0.029 19 November 2005 Next step in growing the decision tree {D1, D2, ..., D14} [9+, 5-] Outlook Sunny Overcast Rain {D1, D2, D8, D9, D11} [2+, 3-] {D3, D7, D12, D13} [4+, 0-] {D4, D5, D6, D10, D14} [3+, 2-] ? Yes ? Which attribute should be tested here? Ssunny = {D1, D2, D3, D9, D11} Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970 Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570 Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019 19 November 2005 Stopping condition 1. Every attribute has already been included along this path through the tree 2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero) Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain) 19 November 2005 Over-fitting in decision trees  The generated tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Result  is in poor accuracy for unseen objects Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a set of data different from the training data to decide which is the “best pruned tree” 19 November 2005 Converting a tree to rules outlook sunny o’cast humidity high no rain wind yes normal yes true no false yes IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes 19 November 2005 Attributes with many values  If attribute has many values (e.g., days of the month), ID3 will select it  C4.5 uses GainRatio instead GainRatio(S, A)  Gain(S, A) SplitInformation(S, A) c SplitInformation(S, A)   i1 Si S log2 Si S where Si is subset of S with A has value v i 19 November 2005 Bayesian classification Learning statistical classifiers from supervised data basing on Bayes theorem and assumptions on independence/dependence of the data. 19 November 2005 What are Bayesian classification?  Bayesian classifiers are statistical classifiers. Bayesian classification is based on Bayes theorem.  Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.  Bayesian belief networks are graphical models allow the representation of dependencies among subsets of attributes. 19 November 2005 Bayes theorem  Let X be an object whose class label is unknown  Let H be some hypothesis, such as that X belongs to class C.  For classification, we want to determine the posterior probability P(H|X), of H conditioned on X.  Example: Data object consists of fruits, described by color and shape  Suppose X is red and round, and H is the hypothesis that X is an apple.  P(H|X) reflects our confidence that X is an apple given that we have seen that X is red and round. P(apple | red and round) = ? 19 November 2005 Bayes theorem  In contrast, P(H) is the prior probability of H.  In our example, P(H) is the probability that any given data object is an apple, regardless of how the data sample looks (independent of X).  P(X|H) is the likelihood of X given H, that is the probability that X is red and round given that we know that it is true that X is an apple.  P(X), P(H), and P(X|H) may be estimated from the given data. Bayes theorem allows us to calculate P(H|X) P( H | X )  P( X | H ) P( H ) P( X ) P(apple | red  round )  P(red  round | apple) P(apple) P(red  round ) 19 November 2005 Naïve Bayesian classification  Suppose X = (x1, x2, …, xn), attributes A1, A2, …, An  There are m classes C1, C2, …, Cm  P(Ci|X) denotes probability that X is classified to class Ci.  Example: P(class = N | outlook=sunny, temperature=hot, humidity=high, wind=strong)  Idea: assign to object X the class label Ci such that P(Ci|X) is maximal, i.e., P(Ci|X) > P(Cj|X), j, ij.  Ci is called the maximum posterior hypothesis. 19 November 2005 Estimating a posteriori probabilities  Bayes theorem: P(X | Ci )P(C i ) P(Ci | X)  P(X)  P(X) is constant. Only need maximize P(X|Ci) P(Ci)  Ci such that P(Ci |X) is maximum = Ci such that P(X| Ci)·P(Ci) is maximum  If prior probability is unknown, commonly assumed that P(C1) = P(C2) = … = P(Cm), and we would maximize P(X|Ci)  Otherwise, P(Ci) = relative frequency of class Ci = Si/S  Problem: computing P(X|Ci) is unfeasible! 19 November 2005 Naïve Bayesian classification  Naïve assumption: We have P(X|Ci) = P(x1,…,xn|Ci), if attributes are independent then P(X|Ci) = P(x1|Ci) x … x P(xn|Ci)  If Ak is categorical then P(xk|Ci) = Sik/Si where Sik is the number of training objects of class Ci having the value xk for Ak, and Si is the number of training objects belonging to Ci.  If Ak is continuous then the attribute is typically assumed to have a Gaussian distribution, so that  1 P( xk | Ci )  e 2  Ci  ( xk  Ci ) 2 2 C2i To classify an unknown object X, P(X|Ci)P(Ci) is evaluated for each class Ci. X is then assigned to the class Ci if and only if P(X|Ci)P(Ci) > P(X|Cj)P(Cj), for 1  j  m, j  i. 19 November 2005 Play-tennis example: estimating P(xk|Ci) Days D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Class N N Y Y Y N Y N Y Y Y Y Y N outlook P(sunny|Y) = 2/9 P(sunny|N) = 3/5 P(overcast|Y) = 4/9 P(overcast|N) = 0 P(rain|Y) = 3/9 P(rain|N) = 2/5 temperature P(hot|Y) = 2/9 P(hot|N) = 2/5 P(mild|Y) = 4/9 P(mild|N) = 2/5 P(cool|Y) = 3/9 P(cool|N) = 1/5 humidity P(high|Y) = 3/9 P(high|N) = 4/5 P(Y) = 9/14 P(normal|Y) = 6/9 P(normal|N) = 2/5 P(N) = 5/14 windy P(strong|Y) = 3/9 P(strong|N) = 3/5 P(weak|Y) = 6/9 P(weak|N) = 2/5 19 November 2005 Play-tennis example: classifying X outlook P(sunny|Y) = 2/9 P(sunny|N) = 3/5 P(overcast|Y) = 4/9 P(overcast|N) = 0 P(rain|Y) = 3/9 P(rain|N) = 2/5  An unseen object X = <rain, hot, high, weak>  P(X|Y)P(Y) = P(rain|Y)P(hot|Y) P(high|Y)P(weak|Y)P(Y) = 3/9  2/9  3/9  6/9  9/14 = 0.010582  P(X|N)P(N) = P(rain|N) P(hot|N)  P(high|N)  P(weak|N)  P(N) = 2/5  2/5  4/5  2/5  5/14 = 0.018286  Object X is classified in class N (don’t play) temperature P(hot|Y) = 2/9 P(hot|N) = 2/5 P(mild|Y) = 4/9 P(mild|N) = 2/5 P(cool|Y) = 3/9 P(cool|N) = 1/5 humidity P(high|Y) = 3/9 P(high|N) = 4/5 P(normal|Y) = 6/9 P(normal|N) = 2/5 windy P(true|Y) = 3/9 P(true|N) = 3/5 P(false|Y) = 6/9 P(false|N) = 2/5 19 November 2005 The independence hypothesis  It makes computation possible  It yields optimal classifiers when independence satisfied  But it is seldom satisfied in practice, as attributes (variables) are often correlated  Attempts to overcome this limitation, among others:  Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes  Decision trees, that reason on one attribute at the time, considering most important attributes first 19 November 2005 Other classification methods  Neural Networks  Instance-based Classification  Genetic Algorithms  Rough Set Approach  Statistical Approaches  Support Vector Machines  etc. 19 November 2005 Mining with neural networks H1 H2 color = dark H3 H4 # nuclei = 1 C1 C2 # tails = 2 C3 C4 Healthy Cancerous 19 November 2005 Mining with neural networks  Advantages      prediction accuracy is generally high robust, works when training examples contain errors output may be discrete, real-valued, or a vector of several discrete or real-valued attributes fast evaluation of the learned target function Criticism    long training time difficult to understand the learned function (weights) not easy to incorporate domain knowledge 19 November 2005 Instance-based classification  Instance-based classification   Using most similarity individual instances known in the past to classify a new instance Typical approaches  k-nearest neighbor approach   Locally weighted regression   Instances as points in an Euclidean space Constructs local approximation Case-based reasoning  Uses symbolic representations and knowledge-based inference 19 November 2005 Genetics algorithms (GA) EVOLUTION PROBLEM SOLVING Generate initial population do Calculate the fitness of each member Environment Individual Fitness Problem Candidate Solution Quality Fitness  chances for survival and reproduction Quality  chance for seeding new solutions ..\form\CSD\Cryst.exe // simulate another generation do 1. Select parents from current population 2. Perform crossover to add offspring to the new population while new population is not full 1. Merge new population into the current population 2. Mutate current population while not converged 19 November 2005 Rough set approach  Rough sets are used to approximately or “roughly” define equivalent classes X  A rough set for a class C is approximated by two sets:  A lower approximation (certain to be in C)  A upper approximation (possible to be in C)  Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc. U { ,  Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), Kluwer Academic Pub., 1997)  Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), PhysicaVerlag, 1998. , , , Equivalence classes ,} R  {color , shape} Eq. class / color  {{ , , }, { , }} Eq. class / shape  {{ , }, { , }, { }} Eq. class  {{ , }, { }, { }, { }} X { , }, X *  { * X },  { , , } 19 November 2005 Association rule mining Description learning that aims to find all possible associations from data. 19 November 2005 Market basket analysis  Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”  Helps develop marketing strategies by gaining insight into which items are frequently purchased together by customers How often people buy onigiri and beer together? 19 November 2005 Rule measures: support and confidence Customer buys both Customer buys beer Customer buys onigiri Transaction ID Items Bought 2000 1000 4000 5000 A,B,C A,C A,D B,E,F  Association rule X Y  support s = probability that a transaction contains X and Y  confidence c = conditional probability that a transaction having X also contains Y  If minimum support 50%, minimum confidence 50%:  A  C (s=50%, c=66.6%)  C  A (s=50%, c=100%) 19 November 2005 Basic concepts  The rule X  Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y.  The rule X  Y has support s in the transaction set D if s% of transactions in D contain X  Y.  Confidence denotes the strength of implication and support indicates the frequencies of the occurring patterns in the rule support(X  Y)  P(X and Y)  # trans. containing both X and Y # all trans. in the database # trans. containing X and Y confidence (X  Y)  P(Y | X)  P(Xand Y)/P(X)  # trans. containing X 19 November 2005 Association mining: Apriori algorithm It is composed of two steps: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence 19 November 2005 Association mining: Apriori principle Transaction ID 2000 1000 4000 5000 For rule A  C Items Bought A,B,C A,C A,D B,E,F Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% support = support({A and C}) = 50% confidence = support({A and C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent (if an itemset is not frequent, its supersets are not) 19 November 2005 The Apriori algorithm 1. Find the frequent itemsets: the sets of items that have support higher than the minimum support  A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset  Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-itemset) by from candidate itemsets Ck (Lk  Ck) C1  …  Li-1  Ci  Li  Ci+1  …  Lk 2. Use the frequent itemsets to generate association rules. 19 November 2005 Apriori algorithm: Finding frequent itemsets using candidate generation  Mining frequent itemsets for Boolean association rules  It employs an iterative approach to level-wise search where k-itemsets are used to explore (k+1)-itemsets   First, the set of frequent 1-itemsets L1 is found.  L1 is used to find the set of 2-itemsets L2,  then L2 is used to find L3,  and so on, until no more frequent k-itemsets can be found. To improve the efficiency of the level-wise generation of frequent itemsets, the important Apriori property is used to reduce the search space. 19 November 2005 Apriori algorithm: Finding frequent itemsets using candidate generation  If an itemset l does not satisfy the minimum support threshold min_sup, then l is not frequent, that is, P(l)<min_sup  If an item A is added to the itemset l, then the resulting itemset (i.e., l  A) cannot occur more frequently than l. Therefore, l  A is not frequent either, that is, P(l  A) < min_sup  This is the anti-monotone property: if a set cannot pass a test, all of its supersets will fail the same test as well 19 November 2005 Apriori algorithm: join step to find Ck  To join Lk-1 with itself to generate a set Ck of candidates kitemsets, and to use it to find Lk  Given l1, l2  Lk-1, the notation li[j] refers to the jth item in li  Apriori assumes that items within a transaction or itemset are sorted in lexicographic order  The join, Lk-1  Lk-1, is performed where members of Lk-1 are joinable if their first (k-2) items are in common. That is, l1, l2  Lk-1 are joined if (l1[1] = l2[1])  (l1[2]= l2[2])…  (l1[k-2]= l2[k-2])  (l1[k-1]<l2[k-1])  The resulting itemset by joining l1 and l2 is l1[1] l1[2] … l1[k-2] l1[k-1] l2[k-1] 19 November 2005 Apriori algorithm: The prune step  Ck is a superset of Lk, i.e. its members may or may not be frequent, but all frequent k-itemsets are included in Ck (Lk  Ck)  A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (all candidates having a count no less than the min_sup_count are frequent and therefore belong to Lk)  Ck can be huge. To reduce the size of Ck: use apriori property  Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset  Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck (can be tested quickly by a hash tree of frequent itemsets) 19 November 2005 Example (min_sup_count = 2) Transactional data TID List of items_IDs T100 T200 T300 T400 T500 T600 T700 T800 T900 I1, I2, I2, I1, I1, I2, I1, I1, I1, I2, I4 I3 I2, I3 I3 I3 I2, I2, I5 I4 I3, I5 I3 Scan D for count of each candidate Compare candidate support count with minimum support count C1 L1 Itemset Sup.Count Itemset Sup.Count {I1} {I2} {I3} {I4} {I5} 6 7 6 2 2 {I1} {I2} {I3} {I4} {I5} 6 7 6 2 2 19 November 2005 Example (min_sup_count = 2) C2 Itemset Generate {I1, I2} candidates {I1, I3} C2 from L1 {I1, I4} using {I1, I5} Apriori {I2, I3} principle {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} C2 Scan D for count of each candidate Generate candidates C3 from L2 Itemset using Apriori {I1, I2, I3} principle {I1, I2, I5} Compare candidate support count with minimum support count Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0 Scan D for count of each candidate C3 Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 Compare candidate support count with minimum support count L2 Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 L3 Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 19 November 2005 Example (min_sup_count = 2) 1. Join C3 = L2  L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}  {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}} 2. Prune using the Apriori property: all nonempty subsets of a frequent itemset must also be frequent. Any candidate has a subset that is not frequent? 2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, {I2, I3} which are all members of L2. Therefore, keep {I1, I2, I3} in C3 2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, {I2, I5} which are all members of L2. Therefore, keep {I1, I2, I5} in C3 2-item subsets of {I1, I3, I5} are {I1, I3}, {I1, I5}, {I3, I5}. Subset {I3, I5} is not a member of L2. Therefore, remove {I1, I3, I5} from C3, and so on 3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} 19 November 2005 Cluster analysis Description learning that aims to detect and groups of similar objects from unsupervised data. 19 November 2005 What is cluster analysis?  A cluster is a collection of data objects satisfying  Objects in this cluster are similar to one another  Objects in this cluster are dissimilar to the objects in other clusters  The process of grouping objects into clusters is called clustering 19 November 2005 Mining with clustering  Clustering analyzes data objects without consulting a known class label.  The objects are clustered or grouped based on the principle of maximizing the between-class similarity and minimizing the within-class similarity  Partition-based clustering for large sets of numerical data.  Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets 19 November 2005 What is good clustering?  A good clustering method will produce high quality clusters with  high intra-class similarity (within a class)  low inter-class similarity (between classes)  The quality of clustering basically depends on the similarity measure and the cluster representative used by the method  New forms of clustering require different criteria of quality. 19 November 2005 Clustering in different fields  Statistics: since many years, focus on distancebased clustering (S-Plus, SPSS, SAS)  Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept  KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data 19 November 2005 Typical requirements of clustering  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Ability to deal with noisy data  Insensitivity to the order of input records  High dimensionality  Constraint-based clustering  Interpretability and usability 19 November 2005 Clustering methods in KDD  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Model-Based Methods 19 November 2005 Partitioning methods  Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters  The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”  C:\STAT\STA_WIN.EXE 19 November 2005 K-means algorithm (K=2) Two centers selected randomly from n objects Reform two new clusters Calculate two new centers 1 Form two clusters by assigning each object to its nearest center 2 Calculate two new centers 3 Repeat step 2 and 3 until the stopping conditions hold 4 19 November 2005 Partitioning methods  The k-means algorithm is sensitive to outliers  The k-medoids method uses medoid (the most centrally located object in a cluster)  The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.  PAM (Partitioning Around Medoids)  From k-Medoids to CLARA (Clustering LARge Applications)  From CLARA to CLARANS (Clustering LARge Applications based on RANdomized Search) 19 November 2005 Hierarchical methods Partition Q is nested into partition P if every component of Q is a subset of a component of P. P  {x1 , x4 , x7 , x9 , x10},{x2 , x8},{x3 , x5 , x6 } Q  {x1 , x4 , x9 },{x7 , x10},{x2 , x8},{x5},{x3 , x6 } A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence. C:\Documents and Settings\Ho Tu Bao\Desktop\STATISTICA.lnk 19 November 2005 Hierarchical clustering: chameleon  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling.  Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters. 19 November 2005 Density-based methods  Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density  DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)  DBSCAN result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0 DENCLUE: Based on Density Distribution Functions (Kernel Estimation) 19 November 2005 Text mining Finding unknown useful information from huge collection of textual data. 19 November 2005 What is text mining?  “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.  “An exploration and analysis of textual (natural language) data by automatic and semi automatic means to discover new knowledge”. A branch of data mining, targets discovering and extracting knowledge from text documents 19 November 2005 テキストマイニング: 研究事例生物医学文献タイトルからの科学的根拠の抽出 (Swanson & Smalheiser, 1997)  “stress is associated with migraines” “ストレスは片頭痛を伴う”  “stress can lead to loss of magnesium” “ストレスはマグネシウム損失の原因となる”  “calcium channel blockers prevent some migraines” “カルシウム拮抗薬は片頭痛を予防することがある”  “magnesium is a natural calcium channel blocker” “マグネシウムは天然のカルシウム拮抗薬である” 抜粋した文の断片を人間の医学専門知識を使って組合せ，文献にない新しい仮説を導き出す  Magnesium deficiency may play a role in some kinds of migraine headache マグネシウムはある種の片頭痛に関与するらしい 19 November 2005 Motivation for text mining  Approximately 80% of the world’s data is held in unstructured formats (source: Oracle Corporation)  Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery. 20% 80% Structured Numerical or Coded Information Unstructured or Semi-structured Information 19 November 2005 Disciplines that influence text mining  Computational linguistics (NLP)  Information extraction  Information retrieval  Web mining  Regular data mining Text Mining = Data Mining (applied to text data) + Language Engineering 19 November 2005 Computational linguistics   Goal: automated language understanding  this isn’t possible  instead, go for subgoals of text analysis, e.g.,  word sense disambiguation  phrase recognition  semantic associations Common current approach (trend):  statistical analyses over very large text collections Consider a word like "string" or "rope." No computer today has any way to understand what those things mean. For example, you can pull something with a string, but you cannot push anything. You can tie a package with string, or fly a kite, but you cannot eat a string or make it into a balloon. In a few minutes, any young child could tell you a hundred ways to use a string  or not to use a string  but no computer knows any of this. 19 November 2005 Computational linguistics Lexical / Morphological Analysis Tagging text Shallow parsing The woman will give Mary a book POS tagging Chunking Syntactic Analysis Grammatical Relation Finding The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN chunking Named Entity Recognition Word Sense Disambiguation [The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP Semantic Analysis Reference Resolution relation finding subject [The woman] [will give] [Mary] [a book] Discourse Analysis meaning i-object object 19 November 2005 Archeology of computational linguistics  1990s–2000s: Statistical learning  algorithms, evaluation, corpora Trainable parsers  1980s: Standard resources and tasks  Penn Treebank, WordNet, MUC Trainable FSMs  1970s: Kernel (vector) spaces  clustering, information retrieval (IR)  1960s: Representation Transformation  Finite state machines (FSM) and Augmented transition networks (ATNs)  1960s: Representation—beyond the word level  lexical features, tree structures, networks 19 November 2005 Information retrieval  Given: Documents source A source of textual documents A user query (text based)  Find: Query E.g. migraines causes A set (ranked) of documents that are relevant to the query IR System Document Ranked Documents Document Document 19 November 2005 Evaluation measures retrieved relevant retrieved  relevant retrieved  relevant , recall  retrieved relevant (  2  1) PR F , P  precision , R  recall  2P  R precision  19 November 2005 Information extraction vs. information retrieval: Finding “things” but not “pages” the process of extracting text segments of semistructured or free text to fill data slots in a predefined template foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2004 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 19 November 2005 What is information extraction?  Given:  A source of textual documents  A well defined limited query (text based)  Find:  Sentences with relevant information (i.e., identify specific semantics elements such as entities, properties, relations)  Extract the relevant information and ignore non-relevant information (important!)  Link related information and output in a predetermined format 19 November 2005 Example: template extraction Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natioanl Liberation Front (FMLN) of the crime …. Garcia Alvarado, 56, was killed when a bomb places by urban guerillas on his vehicle explored as it came to a halt at an intersection in downtown San Salvador … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured. Incident: Date Incident: Location Incident: Type Perpetrator: Individual ID Perpetrator: Organization ID Human target: Name 19 Apr 89 El Salvador: San Salvador Bombing “urban guerillas” “FMLN” “Roberto Garcia Alvarado” 19 November 2005 Zipf’s “law” and its consequence   One of curiosities in text analysis and mining (1935)  The product of the frequency of words (f) and their rank (r) is approximately constant f = C * 1/r C  N/10 Rank = order of words’ frequency of occurrence   Always a few very frequent tokens that are not good discriminators.  Called “stop words” in Information Retrieval  Usually correspond to linguistic notion of “closed-class” words  English examples: to, from, on, and, the, ...  Grammatical classes that don’t take on new members. Typically  A few very common words  A middling number of medium frequency words  A large number of very infrequent words Medium frequency words most descriptive 19 November 2005 Text mining process  Text preprocessing  Syntactic/Semantic analysis  Features   Generation Bag of words  Features text - Term frequency - Document frequency - Term proximity - Document length - etc. Selection  Simple counting  Statistics Text/Data Mining  Classification  Clustering  Summarization  Etc.  Analyzing results 19 November 2005 Typical issues and techniques  Text Categorization (text classification)  Text Clustering  Text Summarization  Trend Detection  Relationship Analysis  Information Extraction  Question-Answering  Text Visualization  etc. 19 November 2005 Text categorization (classification)  Task: Assignment of one or more labels from a defined set to a document  Example category set  MeSH medical hierarchy  JAIST’s library, Library of Congress subject headings  Idea: Content vs. External Meta-Data  Techniques: Supervised classification     pre- Decision trees Naïve Bayesian classification Support Vector Machines etc. 19 November 2005 Text clustering  Task: Detecting topics within a document collection, assigning documents to those topics, and labeling these topic clusters  Scatter/Gather clustering      Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing typical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes” Techniques: Different clustering techniques (similarity between texts, texts and word densities, etc.) 19 November 2005 Text summarization  A text is entered into the computer and a summarized text is returned, which is a non redundant extract from the original text.  A process of text summarization  Sentence extraction: Find a set of important sentences that covers the gist meaning of the text document  Sentence reduction: Convert a long sentence to a short one without losing the meaning  Sentence combination: Combine sentences to make a text. 19 November 2005 Emerging trend detection   Task: Detecting topic areas that are growing in interest and utility over time (emerging trends) Example:     Year Number of documents 1994 3 “Find sales trends by product and correlate with occurrences of company name in business news articles” 1995 1 1996 8 1997 10 KDD’03 challenges: From arXiv (since 1991) 500,000 articles on High Energy Particle Physics, predict, says, # citation in a period of a given articles, etc. 1998 170 1999 371 INSPEC®[INS] database search on keyword “XML” COE project: Can we detect emerging trends in materials science, information science, and biology? 19 November 2005 Question Answering  Task: Give answer to question (document retrieval: find documents relevant to query)  Example:  Who  invented the telephone? Alexander Graham Bell  When  was the telephone invented? 1876 (Buchholz & Daelemans, 2001)  Imagine how to automatically answer such a question? 19 November 2005 Text visualization Network Maps http://www.lexiquest.com Landscapes http://www.aurigin.com 19 November 2005 Web mining Finding unknown useful information from the World Wide Web. 19 November 2005 Data Mining and Web Mining  Data mining turns data into knowledge.  Web mining is to apply data mining techniques to extract and uncover knowledge from web documents and services. 19 November 2005 WWW specifics  Web: A huge, widely-distributed  Highly heterogeneous  Semi-structured  Hypertext/hypermedia  Interconnected  information repository Web is a huge collection of documents plus  Hyper-link  Access information and usage information 19 November 2005 Web user tasks   Finding relevant information  The user usually uses a simple keyword query and receive a list of ranked pages  Current problem: low precision (irrelevance of search result), and low recall (inability to index all the information on the Web) Creating new knowledge over the existing data   Personalizing the information   Want to extract knowledge from Web data (assuming to have it) People differ in the contents and presentations they prefer while interacting with the Web Learning about consumers and individual users  Knowing what the customers do and want 19 November 2005 Web mining taxonomy  Web Content Mining Discovery of information from Web contents (various types of data such as textual, image, audio, video, hyperlinks, etc.)  Web Structure Mining Discovery of the model underlying the link structures of the Web. The model is based on the topology of the hyperlinks with or without the description of the links.  Web Usage Mining Discovery of information from web users’ sessions and behaviors (secondary data derived from the interactions of the users while interacting with the Web). 19 November 2005 Web mining taxonomy Webmining Mining Web Web Content Mining Web Page Content Mining Web Structure Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking 19 November 2005 Web mining taxonomy Web Mining Web mining Web Content Mining Web Structure Mining Web Page Content Mining Web Page Summarization • WebLog (Lakshmanan et.al. 1996) • WebSQL(Mendelzon et.al. 1998) …: Web structuring query languages; Can identify information within given web pages • Ahoy! (Etzioni et.al. 1997): Uses heuristics to distinguish personal home pages from other web pages • ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking 19 November 2005 Web mining taxonomy Web Mining Web mining Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining Search Result Mining Search Engine Result Summarization Clustering Search Result (Leouski and Croft, 1996; Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets General Access Pattern Tracking Customized Usage Tracking 19 November 2005 Web mining taxonomy Web Mining Web mining Web Content Mining Search Result Mining Web Page Content Mining Web Structure Mining Using Links • PageRank (Brin and Page, 1996), HITS (Kleinberg, 1996) • CLEVER (Chakrabarti et al., 1998) Use interconnections between web pages to give weight to pages Web Communities • Communities crawling (Kumar et al, 1999), etc. Using Generalization • MLDB (1994), VWV (1998) Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure. Web Usage Mining General Access Pattern Tracking Customized Usage Tracking 19 November 2005 Web mining taxonomy Web Mining Web mining Web Content Mining Web Page Content Mining Search Result Mining Web Structure Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking • Web Log Mining (Zaïane, Xin and Han, 1998) Uses KDD techniques to understand general access patterns and trends. 19 November 2005 Web mining taxonomy Web mining Web Content Mining Web Page Content Mining Search Result Mining Web Structure Mining General Access Pattern Tracking Web Usage Mining Customized Usage Tracking • Adaptive Sites (Perkowitz and Etzioni, 1997) Analyses access patterns of each user at a time. Web site restructures itself automatically by learning from user access patterns. 19 November 2005 Web mining process 1. Resource finding: The task of retrieving intended Web documents 2. Information selection and pre-processing: Automatically selecting and preprocessing specific information from retrieved Web resources 3. Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites 4. Analysis: validation and/or interpretation of the mined patterns 19 November 2005 Data and knowledge visualization Sunday 11-12 PM Fisheye view Tree map Hyperbolic tree Lunch time MagicLens Cone tree 19 November 2005 KDD products and tools Salford Systems SPSS Silicon Graphics SAS IBM RuleQuest Research (C4.5) 19 November 2005 Outline  Why knowledge discovery and data mining?  Basic concepts of KDD  KDD techniques: classification, association, clustering, text and Web mining  Challenges and trends in KDD  Case study in medicine data mining 19 November 2005 Challenges of KDD Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes) [Problems: efficiency, scalability?] Different types of data in different forms (mixed numeric, symbolic, text, image, voice,…) [Problems: quality, effectiveness?] Data and knowledge are changing Human-Computer Interaction and Visualization 19 November 2005 Large datasets and high dimensionality  H1 H2 H3 H4   C1 C2  C3 C4  3 attributes each has 2 values: #instances = 23 = 8 #patterns =27 What if #attributes increases? Size of instance space and pattern space increased exponentially p attributes each has d values, size of instance space is dp 38 attributes each has 10 values: #instances = 1038 19 November 2005 Possible solutions  Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)  Sampling (instance selection)  Dimensionality reduction (feature selection)  Approximation methods  Massively parallel processing  Integration of machine learning and database management 19 November 2005 Numerical vs. symbolic data Combinatorial search in hypothesis spaces (machine learning) Attribute Numerical No structure   Ordinal structure  Ring structure     Age, Temperature, Taste, Income, Length Symbolic Places, Color Nominal (categorical) Rank, Resemblance Ordinal Measurable Often matrix-based computation (multivariate data analysis) 19 November 2005 Mining with decision trees  Attribute selection  Pruning trees  From trees to rules (high cost of pruning)  Visualization  Data access: recent development on very large training sets, fast, efficient and scalable (inmemory and secondary storage) (well-known systems: C4.5 and CART) 19 November 2005 Scalable decision tree induction methods  SLIQ (Mehta et al., 1996) builds an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (J. Shafer et al., 1996) constructs an attribute list data structure  PUBLIC (Rastogi & Shim, 1998) integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (Gehrke, Ramakrishnan & Ganti, 1998)  separates the scalability aspects from the criteria that determine the quality of the tree  builds an AVC-list (attribute, value, class label) 19 November 2005 Mining with neural networks  Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)  Extracting or making sense of numeric weights associated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem 19 November 2005 Mining with association rules  Improving the efficiency  Database scan reduction: partitioning (Savaseve 95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find nonredundant rules (3000 times less, Zaki KDD’2000)  Parallel  mining of association rules New measures of association  Interestingness  Generalized and exceptional rules and multiple-level rules 19 November 2005 Mining scientific data  Data Mining in Bioinformatics  Data Mining in Astronomy and Earth Sciences  Mining Physics and Chemistry data  Mining Large Image Databases  etc. 19 November 2005 Solutions to mining huge datasets  Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances  Massively parallel processing   Data-parallel vs. Control-parallel Data Mining Client/Server Frameworks for Parallel Data Mining Mining Very Large Databases With Parallel Processing Alex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998 19 November 2005 Example of a scalable algorithm  Mixed Similarity Measures (MSM):  Goodall (1966) time O(n3), Diday and Gowda (1992),  Ichino  Li  and Yaguchi (1994), & Biswas (1997) Time O(n2logn2), Space O(n2): P̂ij * ˆ New and Efficient MSM (Binh & Bao, 2000): Pij ˆ  1  Pˆ *  Time and Space O(n): P ij ij 19 November 2005 Comparative results US Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000) ＃cases 500 (0.2M) 1.000 (0.5M) 1.500 (0.9M) 2.000 (1.1M) 5.000 (2.6M) 10.000 (5.2M) 199.523 (102M) # values 497 992 1.486 1.973 4.858 9.651 97.799 time of LiBis O(n2logn2) 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app Time of OURS O(n) 0.1s 0.2s 9.2s 36m26s Memory of LiBis O(n2) Memory of OURS O(n) 5.3M Preprocessing 0.3s 0.5s 2.8s 20.0M 44.0M 77.0M 455.0M not app not app 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0M 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s 19 November 2005 High performance computing for NLP   Experimental environments  Massively parallel computer (Cray XT3)  Linux OS, using C/C++ and MPI library Experiments for noun phrase chunking  Cross-validation (CV) on 25 sections of Penn TreeBank: about 1,000,000 words (more than 40,000 English sentences)  Highest F1 score: 96.32% Method Precisi on Recall F1measure  On single CPU (estimated): 2183.25 minutes ( 36.38 hours) for each fold  909.69 hours ( 37.9 days) for all 25 folds of CVtest [KM01] 95.62% 95.93% 95.77% [TKS00] 95.04% 94.75% 94.90% [TV99] 93.71% 93.90% 93.81% On 45 parallel processes of Cray XT3 system: 51.5 minutes for each fold  21.47 hours ( 1 day) for all 25 folds of CV-test [RM95] 93.10% 93.50% 93.30% Ours 96.42% 96.23% 96.32%  19 November 2005 Outline  Why knowledge discovery and data mining?  Basic concepts of KDD  KDD techniques: classification, association, clustering, text and Web mining  Challenges and trends in KDD  Case study in medicine data mining 19 November 2005 Background  HBV, HCV are viruses which both cause continuous inflammation of the liver (chronic hepatitis).  The inflammation results in liver fibrosis, and finally liver cirrhosis (LC) after 20 to 30 years which is diagnosed by liver biopsy.  In addition, the cirrhosis patients have a highly potential risk of hepatocellular carcinoma (HCC).  Physicians can treat viral hepatitis with interferon (IFN). However IFN is not always effective, with severe side effects. 19 November 2005 The natural course of hepatitis Hepatitis Fibrosis stage LC HCC 20-30 years F4 F3 F2 The course of HCC? F1 F0 onset of infection time 19 November 2005 The effect of interferon therapy IFN Fibrosis stage HCC F4 LC F3 F2 F1 Effectiveness of interferon? F0 onset of infection time 19 November 2005 時系列データ  提供元：千葉大学医学部第一内科  約800人の患者の20年間にわたるデータ  データの特徴  大規模な未整備時系列データ  検査項目数が非常に多い  検査時期により各検査項目の値やその精度が異なる，欠損値が多い  医者によるバイアスが存在 19 November 2005 Example of the hepatitis dataset Sequences of length 179 for MID1 88 for MID2, and they are irregular 19 November 2005 Problems under consideration P1. Differences in temporal patterns between hepatitis B and C? (HBV, HCV) P2. Evaluate whether laboratory examinations can be used to estimate the stage of liver fibrosis? (F0, F1, F2, F3, F4) P3. Evaluate whether the interferon therapy is effective or not? (Response, Partial response, Aggravation, No response) 19 November 2005 Why data abstraction? Medical knowledge is usually expressed in a form of symbolic statements which is as general as possible. Patient data mostly comprise numeric measurements of various parameters at different points in time. To perform any kind of medical problem solving, patient data have to be “matched” against medical knowledge. 19 November 2005 What is temporal abstraction? ZTT: H>NS ZTT ZTT first was increasingly high then changed to the normal region and stable normal region  Idea: Convert time-stamped points to symbolic intervalbased representation of data  Characteristic: No detail but essence of trend and state change of patients 19 November 2005 Research objectives 1. Develop temporal abstraction methods for describing hepatitis data appropriately for each problem. 1 2 2. Using data mining methods to solve problems from the abstracted data. 19 November 2005 Key issues in temporal abstraction Tasks Requirement 1. Define a description language for abstraction patterns Simple but rich enough to describe abstract patterns 2. Determine the basic abstraction patterns Be typical and significant primitives needed for analysis purpose 3. Transform each sequence into temporal abstraction patterns Efficiently characterize the trends and changes in the temporal data ZTT: H>NS 19 November 2005 Typical tests by physicians  Short-term changed tests 600 Up: GPT, GOT, TTT, ZTT 500 400 300  200 100 2/19/2001 2/19/1999 2/19/1997 2/19/1995 2/19/1993 2/19/1991 2/19/1989 2/19/1987 2/19/1985 0 2/19/1983  Concerning inflammation, changed quickly in days or weeks Can be much higher (even 40 times) than normal range with many peaks 2/19/1981  Long-term changed tests Down: T-CHO, CHE, ALB, TP (liver products) PLT, WBC, HGB. Up: T-BIL, D-BIL, I-BIL, AMONIA, ICG-15.  Concerning liver status, changed slowly in months or years  Do not much exceed the normal range 19 November 2005 Observation of temporal sequences  Make a tool in Matlab to visualize temporal sequences  Observe a large number of temporal sequences for different patients and tests 19 November 2005 Ideas of basic patterns  Short-term changed tests 600 500 Up: GPT, GOT, TTT, ZTT 400 300 200 100  2/19/2001 2/19/1999 2/19/1997 2/19/1995 2/19/1993 2/19/1991 2/19/1989 2/19/1987 2/19/1985 0 2/19/1983 base state and peaks 2/19/1981  Idea: Long-term changed tests Down: T-CHO, CHE, ALB, TP (liver products) PLT, WBC, HGB. Up: T-BIL, D-BIL, I-BIL, AMONIA, ICG-15.  Idea: change of states (compactly capture both state and trend of the sequence) 19 November 2005 時系列抽象化アプローチ発見アルゴリズムを用い，各検査項目の変化特性を「長期変化項目」「短期変化項目」に分類長期変化項目に関する21のパターン短期変化項目に関する8の主要パターン 19 November 2005 Two temporal abstraction methods Abstraction pattern extraction (APE) Temporal relation extraction (TRE) Mapping each given temporal sequence of fixed length into one of pre-defined temporal patterns Detect temporal relations between basic patterns, and extract rules using temporal relations (2001~) (2004~) 19 November 2005 Data and knowledge visualization Simultaneously view the data in different forms: top-left is original data; top-right is histogram of attributes; lower-left is view by parallel coordinates; and lower-right is relations between a conjunction of attribute-value pairs and the class labels. View an individual rule in D2MS: top-left window shows the list of discovered rules, the middle-left and the top-right windows show a rule under inspection, and bottom window displays the instances covered by that rule. 19 November 2005 LC vs. non-LC 19 November 2005 Effectiveness of interferon (LUPC rules)  GOT & GPT occurred as VH or EH in no_response rules  CHE occurred as N/L or L-D in partial_response rules  D-BIL occurred as N/H, H>N, H>N in no_response and partial_response rules. 19 November 2005 Temporal relations A B A is equal to B B is equal to A A B A B A B A B Relations between two basic patterns each happens in a period of time, says, “ALB get down right after heavy inflammation finished”.  Updating a graph of temporal relations (Allen, 1983) and temporal logic (Allen, 1984) A is before B B is after A A meets B B is met by A A overlaps B B is overlapped by A A starts B B is started by A B A finishes B B is finished by A A B A is during B B contains A A  (Allen’s Temporal Logic, 1984) 19 November 2005 Findings for HBV and HCV  Findings are different from general medical observations (no clear distinction between type B and C) R#13 (HBV): “ALP changed from low to normal state” AFTER “LDH changed from low to normal state” (supp. count = 21, conf. = 0.71) R#5 (HCV): “ALP changed from normal to high state” AFTER “LDH changed from normal to low state” (supp. count = 60, conf. = 0.80)  “Quantitatively” confirm findings in medicine (Medline abstracts) R#53 (HCV): “ALB changed from normal to low state” BEFORE “TTT in high state with peaks” AND “ALP from normal to high state” BEFORE “TTT in high state with peaks” (supp. count = 10, conf. = 1.00) Murawaki et al. (2001): the main difference between HBV and HCV is that the base state of TTT in HBV is normal, while that of HCV is high. 19 November 2005 Findings for LC and non-LC  Typical relations in non-LC rules  “GOT in high or very high states with peaks” BEFORE “TTT in high state with peaks” (20 rules contain this temporal relation)  Typical relations in LC rules  “GOT in high or very high states with peaks” AFTER “TTT in high or very high states with peaks” (10 rules contain this temporal relation). 19 November 2005 多種情報源による医療データマイニング肝炎患者履歴データに対する組合せアプローチ  データマイニング  Medlineからのテキストマイニング  専門家の評価と提案 (プロジェクト期間 2004-2007) 19 November 2005 Conclusion  Temporal abstraction shown to be a good alternative approach to hepatitis study.  The results be comprehensive to physicians, and significant rules founds.  Much work to be done for solving three problems, including:  Integrating qualitative and quantitative information  Combining with text mining techniques  Domain expert knowledge 19 November 2005 Summary  KDD is motivated by the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.  There are different methods to find predictive or descriptive models that strongly depend on the data schemes.  The KDD process is necessarily interactive and iterative, and requires human participation in all steps of the process. 19 November 2005 Recommended references  http://www.kdnuggets.com  David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000  Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000  U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996. 19 November 2005 Acknowledgments  Profs. S. Ohsuga, M. Kimura, H. Motoda, H. Shimodaira, Y. Nakamori, S. Horiguchi, T. Mitani, T. Tsuji, K. Satou, among others  Nguyen N.B., Nguyen T.D., Kawasaki S., Huynh V.N., Dam H.C., Tran T.N., Pham T.H., Nguyen D.D., Nguyen L.M., Phan X.H., Le M.H., Hassine B.A., Le S.Q., Zhang H., Nguyen C.H., Nagai K., Nguyen C.H., Nguyen T.P., Tran D.H. 19 November 2005

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data