Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DB MG Data mining fundamentals DataBase and Data Mining Group of Politecnico di Torino Data analysis Most companies own huge databases containing operational data textual documents experiment results Data mining fundamentals DB MG These databases are a potential source of useful information Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino DB MG Data analysis Data mining Information is “hidden” in huge datasets not immediately evident human analysts need a large amount of time for the analysis most data is never analyzed at all The Data Gap information from available data Extraction is automatic 2,000,000 Disk space (TB) since 1995 1,500,000 1,000,000 0 1995 1996 1997 1998 1999 From R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Extracted information is represented by means of abstract models 3 DB MG Example: profiling Query keywords, searched topics and objects Integration of different dimensions E.g., location, time of the day, user interest Text mining 5 User registration, fidelity cards Context-aware data analysis Localization, interesting locations for users DB MG Correlated objects for cross selling Facebook, google+ profiles Dynamic data: posts on blogs, FB, tweets Maps and georeferenced data Recommendation systems Advertisements Market basket analysis Social network data User/service profiling Selected products, requested information, … Search engines and portals 4 Example: profiling Consumer behavior in e-commerce sites denoted as pattern Analyst number 500,000 performed by appropriate algorithms 2,500,000 implicit previously unknown potentially useful 3,000,000 DB MG Non trivial extraction of 4,000,000 3,500,000 2 DB MG Brand reputation, sentiment analysis, topic trends 6 Elena Baralis Politecnico di Torino 1 DB MG Data mining fundamentals DataBase and Data Mining Group of Politecnico di Torino Example: biological data Clinical analysis detecting the causes of a pathology monitoring the effect of a therapy ⇒ diagnosis improvement and definition of new specific therapies expression level of genes in a cellular tissue various types (mRNA, DNA) Patient clinical records CLID personal and demographic data exam results Microarray Biological analysis objectives PATIENT shx013: shv060: shq077: shx009: shx014: shq082: ID 49A34 45A9 52A28 4A34 61A31 99A6 IMAGE:740604 ISG20 || interferon-1.02 stimulated-2.34 gene 20kDa 1.44 0.57 -0.13 0.12 IMAGE:767176 TNFSF13 || tumor-0.52 necrosis -4.06 factor (ligand) -0.29superfamily, 0.71member1.03 13 -0.67 IMAGE:366315 LOC93343 || **hypothetical -0.25 -4.08 protein BC011840 0.06 0.13 0.08 0.06 IMAGE:235135 ITGA4 || integrin,-1.375 alpha 4 (antigen -1.605 CD49D, 0.155alpha -0.015 4 subunit of0.035 VLA-4 receptor) -0.035 shq083: 46A15 0.34 0.22 -0.08 0.505 shx008: 41A31 -0.51 -0.09 -0.05 -0.865 Bio-discovery Textual data in public collections heterogeneous formats, different objectives scientific literature (PUBMed) ontologies (Gene Ontology) Pharmacogenesis DB MG 7 gene network discovery analysis of multifactorial genetic pathologies lab design of new drugs for genic therapies DB MG Knowledge Discovery Process 8 Preprocessing data cleaning selection preprocessing preprocessing data integration transformation data selected data preprocessed data transformed data selected data preprocessed data data mining interpretation Without good quality data, no good quality pattern knowledge DB MG 10 DB MG Data mining origins statistics, artificial intelligence (AI) pattern recognition, machine learning Statistics, database systems AI Traditional techniques are not appropriate because of significant data volume large data dimensionality heterogeneous and distributed nature of data DB MG 11 Analysis techniques Draws from • reconciles data extracted from different sources • integrates metadata • identifies and solves data value conflicts • manages redundancy Real world data is “dirty” pattern KDD = Knowledge Discovery from Data • reduces the effect of noise • identifies or removes outliers • solves inconsistencies Descriptive methods Machine Learning, Pattern Recognition Predictive methods Data Mining Database systems From: P. Tan, M. Steinbach, V. Kumar, “Introduction to Data Mining” 12 Extract interpretable models describing data Example: client segmentation DB MG Exploit some known variables to predict unknown or future values of (other) variables Example: “spam” email detection 13 Elena Baralis Politecnico di Torino 2 DB MG Data mining fundamentals DataBase and Data Mining Group of Politecnico di Torino Classification Classification • Approaches Objectives training data training data model classified data DB MG 14 DB MG 15 Classification • Requirements Applications – accuracy – interpretability – scalability – noise and outlier management training data detection of customer propension to leave a company (churn or attrition) fraud detection classification of different pathology types … dati di training model unclassified data DB MG modello classified data dati classificati dati non classificati 16 DB MG Clustering 17 Clustering • Approaches Objectives classified data unclassified data Classification decision trees bayesian classification classification rules neural networks k-nearest neighbours SVM model unclassified data – – – – – – prediction of a class label definition of an interpretable model of a given phenomenon – partitional (K-means) detecting groups of similar data objects identifying exceptions and outliers – hierarchical – density-based (DBSCAN) – SOM • Requirements – scalability – management of – noise and outliers – large dimensionality – interpretability DB MG 18 DB MG 19 Elena Baralis Politecnico di Torino 3 DB MG Data mining fundamentals DataBase and Data Mining Group of Politecnico di Torino Association rules Clustering Applications Objective customer segmentation clustering of documents containing similar information grouping genes with similar expression pattern … Tickets at a supermarket counter TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coke, Diapers, Milk … DB MG 20 market basket analysis cross-selling shop layout or catalogue design Tickets at a supermarket counter TID Items 1 Bread, Coca Cola, Milk 2 Beer, Bread Beer, Coca Cola, Diapers, Milk 4 Beer, Bread, Diapers, Milk 5 Coca Cola, Diapers, Milk … 2% of transactions contains both items 30% of transactions containing diapers also contain beer 21 Sequence mining 2% of transactions contains both items 30% of transactions containing diapers also contain beer temporal and spatial information are considered example: sensor network data Regression ordering criteria on analyzed data are taken into account example: motif detection in proteins Time series and geospatial data Association rule diapers ⇒ beer 3 … Association rule diapers ⇒ beer Other data mining techniques Applications DB MG Association rules extraction of frequent correlations or pattern from a transactional database prediction of a continuous value example: prediction of stock quotes Sensor network Outlier detection example: intrusion detection in network traffic analysis … DB MG 22 DB MG 23 Open issues Scalability to huge data volumes Data dimensionality Complex data structures, heterogeneous data formats Data quality Privacy preservation Streaming data DB MG 24 Elena Baralis Politecnico di Torino 4