Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Theses Elena Baralis, Silvia Chiusano, Paolo Garza, Tania Cerquitelli, Giulia Bruno, Daniele Apiletti, Alessandro Fiori, Luca Cagliero, Alberto Grand, Luigi Grimaudo Turin, January, 2011 Data Mining Algorithms 1 Disk-based algorithms to support data mining activities (1) Association rule extraction Frequent itemset extraction -> computationally intensive Association rule generation from frequent itemsets Most algorithms exploit ad-hoc main memory data structures to efficiently extract itemsets from a flat file To support the extraction process from large datasets diskbased extraction algorithms need to be exploited Disk-based structures and disk-based algorithms to efficiently support itemset mining DB MG Tania Cerquitelli 3 Disk-based algorithms to support data mining activities (2) Clustering algorithms Discover groups of correlated objects that share similar properties Most algorithms exploit ad-hoc main memory data structures to efficiently discover clusters To support the clustering sessions from large datasets diskbased extraction algorithms need to be exploited Disk-based structures and disk-based algorithms to efficiently support clustering algorithms DB MG Tania Cerquitelli 4 2 An optimizer to support data mining activities Association rule extraction Frequent itemset extraction -> computationally intensive Association rule generation from frequent itemsets Research activity usually focuses on defining efficient algorithms for itemset extraction Different algorithms are suitable for different data distribution Some algorithms have been integrated into a DBMS Open Source kernel Design and develop a module (i.e., an optimizer), in case integrated into a DBMS Open Source kernel (e.g., PostgreSQL), which is able to select for each mining process the best algorithm for the current data distribution DB MG Tania Cerquitelli 5 Disk-based algorithms to support text mining Huge amount of textual data Most algorithms exploit ad-hoc main memory data structures to efficiently perform text mining These approaches rely on the available physical memory and may run out of memory when the analysis is performed on very large databases Design new disk-based structures which will allow the compact representation of very large datasets and will efficiently support data mining algorithms Text mining by exploiting different data mining techniques (e.g., clustering, association rules) DB MG Tania Cerquitelli, Alessandro Fiori, Alberto Grand 6 3 Generalized rule mining with constraints Generalized rules aim at identifying hidden correlations among data at different granularity levels Usage of taxonomies for data aggregation High number of mined rules -> high complexity Constraints restrict the extracted knowledge to a subset of interest Study and implementation of generalized association rule mining algorithms with constraints DB MG Luca Cagliero 7 Bayesian Classification by means of Generalized Rules Generalized rules aim at identifying hidden correlations among data at different granularity levels Usage of taxonomies for data aggregation Bayesian classification exploits a probabilistic model to predict a test instance class Study and implementation of a Bayesian Classifier exploiting Generalized Association Rules DB MG Luca Cagliero 8 4 Dynamic data mining Analysis and comparison of the information extracted by different data mining and knowlegde discovery sessions scheduled over time. Generalized rules aim at identifying hidden correlations among data at different granularity levels Usage of taxonomies for data aggregation Extraction and analysis of dynamic generalized association rules DB MG Luca Cagliero 9 Time Series Classification Time Series Multivariate Time Series Sequence of real values Each data is a set of <attribute: time series> pairs Data arising in broad areas (e.g., medicine, finance, multimedia etc.) Development of algorithms for DB MG Selection of the most discriminant attributes Classification of new data Tania Cerquitelli 10 5 Database systems Distributed databases Challenge Scalability and reliability of web applications delivering social network interactions check-in to physical (real) places sharing complex data (like, comment, photo, and video) Examples: Facebook, Twitter e Foursquare grew by 1000% in a short time Solution Horizontal scalability you can’t add more resources to a single main DB add new “small” DBs in a network of distributed DBs Document-based DBs exploit the friendly approch of non-relational DBs easy built-in replication management and high performance Study the potential of distributed DBs and non-relational DBs References: mongodb.org, http://goo.gl/6L2yC DB MG Daniele Apiletti 12 6 A tool for database conceptual model design Relational databases are designed by means of the Entity-Relationship (ER) model Few graphical tools are currently available for conceptual model design by means of ER models GNU Ferret (http://www.gnuferret.org/) provides a limited set of functionalities Design and implementation of a new tool for ER conceptual model design DB MG 13 Silvia Chiusano Text Mining 7 Summarization Summarization of documents Applications identification of relevant knowledge from news, research articles, blogs clustering sentences with a similar and interesting content biological knowledge extraction from different texts validation of experimental results according to the domain application development of new summarization approaches according to the information of interest enhance data representation to speed up summarization process results presentation oriented to user queries integration of topic detection algorithms Information retrieval, text mining, summarization, clustering DB MG Alessandro Fiori 15 Ontology inference Ontology: a rigorous and exhaustive organization of some knowledge domain hierarchical structure represents relevant entities and their relationships Text mining for ontology inference identify concepts by means of entity recognition approaches extraction of relationships between entities Examples: DBPedia, YAGO Applications: discovering relationships among domain entities from news, research articles, blogs, etc. validate relationships of general purpose ontologies Entity recognition, association rules, text mining DB MG Luca Cagliero, Alessandro Fiori, Alberto Grand 16 8 Social networks Infer knowledge from user-generated content Applications extraction of relevant information from social network sites personalization of web crawlers using user profiles identification of news, locations, etc. User behavior analysis by means of association rule mining summarization approaches to identify relevant information classification of web objects using user-generated content clustering web pages according to the topic develop of recommendation systems using user behavior on social networks Entity recognition, clustering, association rules, text mining DB MG Luca Cagliero, Alessandro Fiori 17 Mining in Specific Application Domains 9 Queries on sensor networks App “The sensor network is a Database” Querying the network Query, Trigger Dati TinyDB Challenge: Data mining techniques to learn correlated attributes Rete di Sensori DB MG acquiring (possibly aggregate) measurements describing the state of the monitored environment which sensors/measurements are correlated? how strong is the correlation? (generally sensor data are highly correlated) when are sensors/measurements correlated? (e.g. from 8:00 a.m. to 11:00 a.m.) 19 Tania Cerquitelli Wireless network traffic analysis Security Wireless network design Tania Cerquitelli Wireless network resource allocation Wireless network traffic analysis by means of data mining algorithms DB MG Characterize traffic profile and detect Internet security threats Association rules Clustering algorithms 20 10 Medical data analysis Analysis of patients’ exam log databases containing patients’ historical data Aims extraction of the most frequent sequences extraction of medical pattern for specific diseases exploiting a compact representation of sets of sequences to allow easier validation Thesis: implementation of algorithms to extract frequent sequences with particular attention to temporal constraints and exam ontologies DB MG Giulia Bruno 21 Gene clustering validation By analysing gene expression data it is possible to cluster genes basing on their behaviour in different experimental conditions The validation of results is critical for two reasons lack of benchmark datasets choice of the right quality index Thesis: development of clustering algorithms and evaluation of quality indices to analyze gene expression data DB MG Giulia Bruno, Alessandro Fiori 22 11 Biological and clinal data integration In the personalized medicine field it is important to integrate heterogeneous medical data (clinical and molecular) heterogeneous data management detection of correlations among experiments Thesis: study and modeling of a database/data warehouse to integrate clinical and molecular data, evaluation of real systems (caBIG), study of physical structures for performance improvement, graphical interfaces for data access DB MG Giulia Bruno, Alessandro Fiori 23 Sports data analysis Physiological data analysis • Assessing athletes' improvement • Assessing blood lactate concentration • Improve the effectiveness of training sessions Knowledge discovery • Profile definition for each athlete (e.g., training heart rate) • Classification of athletes DB MG Tania Cerquitelli 24 12 News analysis Studies DB MG Query Expansion techniques to reduce the query/document mismatch Collaborative filtering based on the idea that groups of similar users share similar contents Content-based filtering based on the idea that groups of similar contents are shared by the same user Hybrid filtering based on the combination of the previous approaches new story detection: discovering new news in a flow of news (breaking news) Topic detection and linking: discovering news of the same topic in a flow of news and relationships among news Topic tracking: discovering future news related to events interesting for the user automatic highlight detection in the context of sport events 25 Alessandro Fiori External stage / internship www.ooros.com Web and mobile apps with social network interactions (Facebook, Twitter, Foursquare, LinkedIn, ...) Data mining techniques to analyze user interactions (both basic, i.e., “like” and “comment”, and on games, contests, etc.) Web and mobile apps exploiting the user geo-location (e.g., Facebook Places, Foursquare, and Gowalla check-in) spatial data analysis (e.g., “my nearest friend”) spatial database indexes Mobile apps (Android, iPhone, etc.) with offline replication handling flaky mobile connections by means of a local DB which syncs with remote DBs DB MG Elena Baralis, Daniele Apiletti 26 13