Download Theses Data Mining Algorithms - DataBase and Data Mining Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Theses
Elena Baralis, Silvia Chiusano, Paolo Garza, Tania Cerquitelli,
Giulia Bruno, Daniele Apiletti, Alessandro Fiori, Luca Cagliero,
Alberto Grand, Luigi Grimaudo
Turin, January, 2011
Data Mining Algorithms
1
Disk-based algorithms to support data mining
activities (1)




Association rule extraction

Frequent itemset extraction -> computationally intensive

Association rule generation from frequent itemsets
Most algorithms exploit ad-hoc main memory data
structures to efficiently extract itemsets from a flat file
To support the extraction process from large datasets diskbased extraction algorithms need to be exploited
Disk-based structures and disk-based algorithms to
efficiently support itemset mining
DB
MG
Tania Cerquitelli
3
Disk-based algorithms to support data mining
activities (2)

Clustering algorithms




Discover groups of correlated objects that share similar properties
Most algorithms exploit ad-hoc main memory data
structures to efficiently discover clusters
To support the clustering sessions from large datasets diskbased extraction algorithms need to be exploited
Disk-based structures and disk-based algorithms to
efficiently support clustering algorithms
DB
MG
Tania Cerquitelli
4
2
An optimizer to support data mining activities





Association rule extraction

Frequent itemset extraction -> computationally intensive

Association rule generation from frequent itemsets
Research activity usually focuses on defining efficient algorithms for
itemset extraction
Different algorithms are suitable for different data distribution
Some algorithms have been integrated into a DBMS Open Source
kernel
Design and develop a module (i.e., an optimizer), in case
integrated into a DBMS Open Source kernel (e.g., PostgreSQL),
which is able to select for each mining process the best algorithm
for the current data distribution
DB
MG
Tania Cerquitelli
5
Disk-based algorithms to support text mining


Huge amount of textual data
Most algorithms exploit ad-hoc main memory data structures to
efficiently perform text mining



These approaches rely on the available physical memory and may run out
of memory when the analysis is performed on very large databases
Design new disk-based structures which will allow the compact
representation of very large datasets and will efficiently support data
mining algorithms
Text mining by exploiting different data mining techniques (e.g.,
clustering, association rules)
DB
MG
Tania Cerquitelli, Alessandro Fiori, Alberto Grand
6
3
Generalized rule mining with constraints




Generalized rules aim at identifying hidden correlations
among data at different granularity levels
 Usage of taxonomies for data aggregation
High number of mined rules -> high complexity
Constraints restrict the extracted knowledge to a subset of
interest
Study and implementation of generalized
association rule mining algorithms with constraints
DB
MG
Luca Cagliero
7
Bayesian Classification by means of Generalized Rules



Generalized rules aim at identifying hidden correlations
among data at different granularity levels
 Usage of taxonomies for data aggregation
Bayesian classification exploits a probabilistic model to
predict a test instance class
Study and implementation of a Bayesian Classifier
exploiting Generalized Association Rules
DB
MG
Luca Cagliero
8
4
Dynamic data mining
Analysis and comparison of the information extracted by
different data mining and knowlegde discovery sessions
scheduled over time.

Generalized rules aim at identifying hidden correlations
among data at different granularity levels
 Usage of taxonomies for data aggregation

Extraction and analysis of dynamic generalized
association rules

DB
MG
Luca Cagliero
9
Time Series Classification

Time Series


Multivariate Time Series



Sequence of real values
Each data is a set of
<attribute: time series> pairs
Data arising in broad areas (e.g.,
medicine, finance, multimedia etc.)
Development of algorithms for


DB
MG
Selection of the most discriminant
attributes
Classification of new data
Tania Cerquitelli
10
5
Database systems
Distributed databases
Challenge

Scalability and reliability of web applications delivering

social network interactions

check-in to physical (real) places

sharing complex data (like, comment, photo, and video)

Examples: Facebook, Twitter e Foursquare grew by 1000% in a short time
Solution

Horizontal scalability

you can’t add more resources to a single main DB

add new “small” DBs in a network of distributed DBs

Document-based DBs

exploit the friendly approch of non-relational DBs

easy built-in replication management and high performance


Study the potential of distributed DBs and non-relational DBs
References: mongodb.org, http://goo.gl/6L2yC
DB
MG
Daniele Apiletti
12
6
A tool for database conceptual model design


Relational databases are designed by means
of the Entity-Relationship (ER) model
Few graphical tools are currently available for
conceptual model design by means of ER
models


GNU Ferret (http://www.gnuferret.org/) provides a limited
set of functionalities
Design and implementation of a new tool for ER
conceptual model design
DB
MG
13
Silvia Chiusano
Text Mining
7
Summarization

Summarization of documents





Applications





identification of relevant knowledge from news, research articles,
blogs
clustering sentences with a similar and interesting content
biological knowledge extraction from different texts
validation of experimental results according to the domain
application
development of new summarization approaches according to the
information of interest
enhance data representation to speed up summarization process
results presentation oriented to user queries
integration of topic detection algorithms
Information retrieval, text mining, summarization, clustering
DB
MG
Alessandro Fiori
15
Ontology inference



Ontology:

a rigorous and exhaustive organization of some knowledge domain

hierarchical structure

represents relevant entities and their relationships
Text mining for ontology inference

identify concepts by means of entity recognition approaches

extraction of relationships between entities

Examples: DBPedia, YAGO
Applications:



discovering relationships among domain entities from news, research
articles, blogs, etc.
validate relationships of general purpose ontologies
Entity recognition, association rules, text mining
DB
MG
Luca Cagliero, Alessandro Fiori, Alberto Grand
16
8
Social networks

Infer knowledge from user-generated content




Applications






extraction of relevant information from social network sites
personalization of web crawlers using user profiles
identification of news, locations, etc.
User behavior analysis by means of association rule mining
summarization approaches to identify relevant information
classification of web objects using user-generated content
clustering web pages according to the topic
develop of recommendation systems using user behavior on social
networks
Entity recognition, clustering, association rules, text mining
DB
MG
Luca Cagliero, Alessandro Fiori
17
Mining in Specific Application Domains
9
Queries on sensor networks

App

“The sensor network is a Database”
Querying the network

Query,
Trigger
Dati

TinyDB
Challenge: Data mining techniques
to learn correlated attributes


Rete di Sensori

DB
MG
acquiring (possibly aggregate)
measurements describing the state of
the monitored environment
which sensors/measurements are
correlated?
how strong is the correlation?
(generally sensor data are highly
correlated)
when are sensors/measurements
correlated? (e.g. from 8:00 a.m. to
11:00 a.m.)
19
Tania Cerquitelli
Wireless network traffic analysis

Security


Wireless network design



Tania Cerquitelli
Wireless network resource
allocation
Wireless network traffic
analysis by means of data
mining algorithms

DB
MG
Characterize traffic profile
and detect Internet security
threats
Association rules
Clustering algorithms
20
10
Medical data analysis


Analysis of patients’ exam log databases containing
patients’ historical data
Aims




extraction of the most frequent sequences
extraction of medical pattern for specific diseases
exploiting a compact representation of sets of sequences
to allow easier validation
Thesis: implementation of algorithms to extract
frequent sequences with particular attention to
temporal constraints and exam ontologies
DB
MG
Giulia Bruno
21
Gene clustering validation


By analysing gene expression data it is possible
to cluster genes basing on their behaviour in
different experimental conditions
The validation of results is critical for two
reasons



lack of benchmark datasets
choice of the right quality index
Thesis: development of clustering algorithms and
evaluation of quality indices to analyze gene expression
data
DB
MG
Giulia Bruno, Alessandro Fiori
22
11
Biological and clinal data integration

In the personalized medicine field it is important to integrate
heterogeneous medical data (clinical and molecular)



heterogeneous data management
detection of correlations among experiments
Thesis: study and modeling of a database/data warehouse to
integrate clinical and molecular data, evaluation of real
systems (caBIG), study of physical structures for
performance improvement, graphical interfaces for data
access
DB
MG
Giulia Bruno, Alessandro Fiori
23
Sports data analysis
Physiological data analysis
• Assessing athletes' improvement
• Assessing blood lactate concentration
• Improve the effectiveness of training
sessions
Knowledge discovery
• Profile definition for each athlete
(e.g., training heart rate)
• Classification of athletes
DB
MG
Tania Cerquitelli
24
12
News analysis
Studies









DB
MG
Query Expansion techniques to reduce the query/document
mismatch
Collaborative filtering based on the idea that groups of similar users
share similar contents
Content-based filtering based on the idea that groups of similar
contents are shared by the same user
Hybrid filtering based on the combination of the previous
approaches
new story detection: discovering new news in a flow of news
(breaking news)
Topic detection and linking: discovering news of the same topic in a
flow of news and relationships among news
Topic tracking: discovering future news related to events interesting
for the user
automatic highlight detection in the context of sport events
25
Alessandro Fiori
External stage / internship
www.ooros.com



Web and mobile apps with social network interactions (Facebook,
Twitter, Foursquare, LinkedIn, ...)
 Data mining techniques to analyze user interactions (both basic,
i.e., “like” and “comment”, and on games, contests, etc.)
Web and mobile apps exploiting the user geo-location (e.g.,
Facebook Places, Foursquare, and Gowalla check-in)
 spatial data analysis (e.g., “my nearest friend”)
 spatial database indexes
Mobile apps (Android, iPhone, etc.) with offline replication
 handling flaky mobile connections by means of a local DB which
syncs with remote DBs
DB
MG
Elena Baralis, Daniele Apiletti
26
13