Download Data Mining - SFU computing science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Special Topics in Database Systems
Martin Ester
Simon Fraser University
School of Computing Science
CMPT 884
Spring 2009
CMPT 884, SFU, Martin Ester, 1-09
1
Introduction
[Fayyad, Piatetsky-Shapiro & Smyth 96]
Knowledge discovery in databases (KDD) is the process of (semi-)automatic
extraction of knowledge from databases which is
• valid
• previously unknown
• and potentially useful.
Remarks
• (semi)-automatic: distinction from manual analysis / OLAP.
Typically, some user interaction necessary.
• valid: in the statistical sense.
• previously unknown: not explicit, no „common sense knowledge“.
• potentially useful: for some given application.
CMPT 884, SFU, Martin Ester, 1-09
2
Introduction
Statistics [Hand, Mannila & Smyth 2001]
• representation of uncertainty
• model-based inferences
• focus on numeric data
Machine Learning [Mitchell 1997]
• knowledge representation
• search strategies
• focus on symbolic data
Database Systems [Han & Kamber 2000]
• data management
• integration of data mining with DBS
• scalability for large databases
CMPT 884, SFU, Martin Ester, 1-09
3
Introduction
Knowledge
KDD Process [Han & Kamber 2000]
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]
Focussing
Preprocessing
Transformation
Data
Mining
Evaluation
Pattern
Database
CMPT 884, SFU, Martin Ester, 1-09
Knowledge
4
Data Mining
Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996]
• Data Mining is the application of efficient algorithms to determine the
patterns contained in some database.
Data-Mining Tasks
• ••
•
•
• • ••
• •
•
•
aa
a
a
clustering
b
b b ab
bb
a
classification
A and B  C
association rules
•
• •
•
• • ••
• •
•
•
generalisation
other tasks: regression, outlier detection . . .
CMPT 884, SFU, Martin Ester, 1-09
5
Trends in KDD Research
KDD 2000 Conference
• New Data Mining Algorithms
• Efficiency and Scalability of Data Mining Algorithms
• Interactive Data Exploration
• Visualization
• Constraints and Evaluation in the KDD Process
CMPT 884, SFU, Martin Ester, 1-09
6
Trends in KDD Research
KDD 2002 Conference
• Statistical Methods
• Frequent Patterns
• Streams and Time Series
• Visualization
• Web Search and Navigation
• Text and Web Page Classification
• Intrusion and Privacy
• Applications
CMPT 884, SFU, Martin Ester, 1-09
7
Trends in KDD Research
KDD 2004 Conference
• Frequent Patterns / Association Rules
• Clustering
• Mining Spatio-Temporal Data
• Mining Data Streams
• Dimensionality Reduction
• Privacy-Preserving Data Mining
• Mining Biological Data
• Applications (Web, biological data, security, . . .)
CMPT 884, SFU, Martin Ester, 1-09
8
Trends in KDD Research
KDD 2006 Conference
• Clustering
• Classification / supervised ML
• Privacy
• Web / Graph Mining
• Web / Text Mining
• Frequent Pattern Mining
• Structured Data
CMPT 884, SFU, Martin Ester, 1-09
9
Trends in KDD Research
KDD 2008 Conference
• Text Mining
• Data Integration
• Social Networks
• Graph Mining
• Distance Functions and Metric Learning
• Active and Semi-supervised Learning
• Pattern Mining
• Collaborative Filtering
CMPT 884, SFU, Martin Ester, 1-09
10
Trends in KDD Research
Some Hot Topics
• Social Networks
THE hot topic of KDD 08
 topic of the only panel
• Graph mining
• Text mining
and information extraction / integration
• Collaborative Filtering
more general, recommender systems
 $1M NetFlix prize
CMPT 884, SFU, Martin Ester, 1-09
11
Overview of this Course
Prerequisites
Foundations of database systems and statistics
Introductory graduate data mining course or equivalent
Objectives
• Introduction into some hot topics of data mining research
• Training in research methodology
• Presentation skills
start thesis work after this class!
CMPT 884, SFU, Martin Ester, 1-09
12
Overview of this Course
Topics
• Graph mining
social network analysis and analysis of biological networks
as driving applications
• Recommender systems
in particular trust-based recommendation
• Information extraction and integration
integration with existing databases
CMPT 884, SFU, Martin Ester, 1-09
13
Overview of this Course
Format
• Tutorial surveys
by instructor
• Written research paper reviews
by students
• Research paper presentations
by students
discussions in class
• Course research projects
by students
on a topic of their choice
CMPT 884, SFU, Martin Ester, 1-09
14
Overview of this Course
Tentative Grading Scheme
• Paper review (20 %)
• Paper presentation (20 %)
• Course project report (40%)
two steps:
project proposal, final project report
• Course project presentation (20 %)
 marking criteria:
originality, technical quality, presentation
CMPT 884, SFU, Martin Ester, 1-09
15
Overview of this Course
Types of Course Projects
• Literature survey
summarize the state-of-the-art and identify open research problems
• New problem
introduce and analyze a new problem
• New algorithm for known problem
implement and evaluate algorithm
• Improvement of existing algorithm
implement and compare algorithm
• Comparison of existing algorithms on a new, interesting dataset
identify criteria for choice of algorithms / open research problems
CMPT 884, SFU, Martin Ester, 1-09
16
Graph Mining
Motivating Applications
• Social network analysis
o What communities exist?
o How does information about a new product spread?
o What customers should be targeted to maximize the profit of a marketing
campaign?
• Analysis of biological networks
o What are the functional modules of an organism?
o How do biological networks evolve in the course of time?
o What protein should be targeted to inhibit some virulent bacteria?
CMPT 884, SFU, Martin Ester, 1-09
17
Graph Mining
Methods
• Frequent subgraph mining
frequent pattern mining approach
• Graph clustering
e.g., normalized cut, i.e. Minimize number of edges between
graph components / clusters
• Graph generative models
probabilistic models that generate graphs similar to
real graphs / networks
CMPT 884, SFU, Martin Ester, 1-09
18
Graph Mining
Challenges
• Complexity of graph algorithms
o Many graph mining problems are NP-hard.
o Real graphs tend to be extremely large.
 need efficient algorithms
• Attribute data
o Many graphs have attributes associated with the nodes.
o Transformation into weighted graph looses a lot of information.
 need new models / algorithms considering relationship and attribute data
CMPT 884, SFU, Martin Ester, 1-09
19
Recommender Systems
Motivating Applications
• Motivation
o The internet provides a flood of information on all kinds of items.
o There is a great need for personalized recommendations.
o The internet also provides a wealth of item ratings / reviews.
• Typical applications
o Movie recommendation
o Product recommendation
o Keyword recommendation
CMPT 884, SFU, Martin Ester, 1-09
20
Recommender Systems
Methods
• Collaborative filtering
o Uses only a database of user – item ratings.
o Recommendation based on ratings by users with similar rating patterns.
• Content-based recommender systems
o Uses information about the content of items and / or the properties of users.
o Recommends items that have content similar to items liked by user.
• Trust-based recommender systems
o Assume a social network / trust network. Trust can be defined explicitly or
implicitly.
o Recommendation based on ratings by trusted neighbors.
CMPT 884, SFU, Martin Ester, 1-09
21
Recommender Systems
Challenges
• High dimensionality and sparsity of data
o The overwhelming majority (> 99%) of user item ratings is unknown.
o Recommendation especially hard for cold start users and controversial items.
 dimensionality reduction, model based methods, trust-based approach
• Fraud
o Memory-based collaborative filtering can be easily manipulated by adding
fraudulent ratings.
 trust-based approach more robust to fraud
• Privacy issues with trust network data
o only very few trust networks are public domain
CMPT 884, SFU, Martin Ester, 1-09
22
Information Extraction and Integration
Motivating Applications
• Importance of unstructured text data
o The overwhelming majority (>= 80%) of human generated information
is not in structured form, but in unstructured text.
• Biomedical literature
o Contains a wealth of valuable information that cannot be processed / searched
automatically.
o Extraction of entities and relationships such as proteins and their localizations.
• Online product reviews
o A lot of product „reviews“ available online in community databases or blogs.
o Companies want to know what customers think of their products.
CMPT 884, SFU, Martin Ester, 1-09
23
Information Extraction and Integration
Methods
• Basic NLP methods
o Part-of-speech tagging
o Lexica, ontologies, . . .
• Machine learning methods
o Typically, supervised classification.
o CRFs and similar methods are state-of-the-art.
• Bootstrapping approach
o Using a small labeled training dataset, find textual extraction patterns.
o Using these patterns, extract further entities / relationships and continue.
CMPT 884, SFU, Martin Ester, 1-09
24
Information Extraction and Integration
Challenges
• Text data is hard to understand
o Many of the NLP problems are still essentially unsolved.
 relatively simple NLP methods often sufficient for information extraction
• Portability across domains
o Extraction methods need to be portable from one domain to another.
o Knowledge engineering approach (domain expert defines rules) is
labor-intensive and expensive.
 machine learning methods
• Entity mentions need to be resolved
o Information extraction produces strings referencing an entity of a given type.
o Without mapping to known real world entities, extracted information is of
limited usefulness.
 need to integrate extracted information with existing databases
CMPT 884, SFU, Martin Ester, 1-09
25
References
Graph mining
- X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial
KDD 08
- Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models,
Diffusion and Case Studies”, Tutorial ECML/PKDD 2007
Recommender systems
- Joseph Konstan, “Introduction to Recommender Systems”, Tutorial
SIGMOD 2008
Information extraction and integration
- Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and
Integration”, Tutorial KDD 06
- AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan, “Managing
Information Extraction”, Tutorial SIGMOD 2006
CMPT 884, SFU, Martin Ester, 1-09
26