Download P-N-RMiner: A Generic Framework for Mining Interesting Structured

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
P-N-RMiner: A Generic
Framework for Mining Interesting
Structured Relational Patterns
Jefrey Lijffijt, Eirini Spyropoulou, Bo Kang, Tijl De Bie
University of Bristol
[JL, BK, TDB ➞ Ghent University, ES ➞ Barclays]
1
Motivation
2
Example
Suppose we have some data,
Profession
User
Age
Check-in
times
and we are interested in lifestyle patterns.
3
Example
What will we find?
Profession
User
Age
Check-in
times
For example: “many people aged 30–45 check in
somewhere both between 7.30 and 8.30 in the
morning and between 11.30 and 12.30 around noon”
Could be interesting if unexpected
4
Example
Complicated data
Profession
User
Age
5
Check-in
times
Example
Complicated data
Profession
User
Age
Check-in
times
It is relational
Multiple professions, many check-in times
Cannot be flattened without loss of information
6
Example
Complicated data
Profession
User
Age
Check-in
times
It has structured attributes
Time of day, age, professions (hierarchy or DAG)
No pattern mining framework deals with all these
7
Example
Complicated data
Profession
User
Age
Check-in
times
Typical solution: choose a granularity, discretise
Different patterns may need different granularity
Worse, some patterns require mixed granularity
8
Example
Complicated data
Profession
User
Age
Check-in
times
1
Profession
Check-in
times
How to avoidAgediscretisation?
User
[1–2]
[1–4]
Figure 1. Example schema of users and check-in times. Additionally, we
know the age and profession of the users. There are three relationship types:
there are relationships between (1) users and check-in times, (2) users and
ages, and (3) users and professions.
[2–3]
3
[2–4]
These structures are similar
• We formalise the problem and a matching pattern syntax,
in a manner as generic as possible (Sec. II). To achieve
this, we adopt an abstract formalisation in terms of a
partial order over the structured values. For example, with
the time-of-day and book ratings, the 9partial order is over
the intervals, where one is ‘smaller’ than another if it is
2
[1–3]
[3–4]
4
Can all be modelled as partial orders
Figure 2. Partial order of all intervals that are supersets of {
and {4}. The partial order corresponds to the superset relation
the relationship instances between the basic entitie
the database, as well as all relationship instances
Pattern mining
The goal
Find interesting and surprising patterns in
massive data
The reality
Find massively many patterns in any data
10
Pattern mining
The goal
Find interesting and surprising patterns in
massive data
The solution
Score patterns for their interestingness
11
Interestingness
What makes a pattern interesting ?
Very many scores (e.g., Geng & Hamilton 2006)
Support, confidence, Piatetsky-Shapiro / lift,
accuracy, cosine, WRAcc, …
12
Interestingness
What makes a pattern interesting ?
Very many scores (e.g., Geng & Hamilton 2006)
Support, confidence, Piatetsky-Shapiro / lift,
accuracy, cosine, WRAcc, chi-square, …
≈ unexpected (frequency of) co-occurrence
13
Interestingness
What makes a pattern interesting ?
≈ unexpected (frequency of) co-occurrence
Some ‘recent’ encompassing frameworks:
Information theory (see De Bie 2011, 2013)
Hypothesis testing (see Lijffijt et al. 2014)
14
Interestingness
What makes a pattern interesting ?
≈ unexpected (frequency of) co-occurrence
Some ‘recent’ encompassing frameworks:
Information theory (this paper)
Hypothesis testing
15
Interestingness
16
Information-theoretic
interestingness
What makes a pattern interesting ?
≈ unexpected (frequency of) co-occurrence
= Surprisal = self-information = –log Pr(pattern)
17
Information-theoretic
interestingness
What makes a pattern interesting ?
≈ unexpected (frequency of) co-occurrence
= Surprisal = self-information = –log Pr(pattern)
Actually
self-information / description length
18
One minor problem
Unexpected as compared to …?
19
Subjective interestingness
Users have different prior beliefs about data/domain
Data mining is an iterative process, users learn as
analysis progresses
Information-theoretic framework (FORSIED) to deal with
this introduced by De Bie (KDD 2011, IDA 2013)
General setting of our research project (ERC Grant
FORSIED): Formalising Subjective Interestingness in
Exploratory Data Mining [we are hiring a PhD student]
20
Subjective interestingness
Unexpected as compared to …
Maximum entropy distribution that satisfies the prior beliefs
21
Subjective interestingness
Unexpected as compared to …
Maximum entropy distribution that satisfies the prior beliefs
Prior beliefs in the form of expectations
Row/column marginals / degree of nodes
First and second order moments for numerical vars
The following edges are present: …
22
Mining patterns
23
Relational Pattern Mining:
RMiner (Spyropoulou et al. 2014)
Profession
User
Age
Check-in
times
Relational: patterns across data tables with arbitrarily many links
(many-to-many etc.)
Database = entities + relationship instances
Every user, age, profession, time is an entity
There is a relationship instance between a user and an age if that
user has that age
24
Relational Pattern Mining:
RMiner (Spyropoulou et al. 2014)
Profession
User
Age
Check-in
times
Relational: patterns across data tables with arbitrarily many links
(many-to-many etc.)
Essentially a graph:
nodes = entities,
Database = entities + relationship instances
edges = relationship
Every user, age, profession, time is an entity
instances
There is a relationship instance between a user and an age if that
user has that age
25
Relational Pattern Mining:
RMiner (Spyropoulou et al. 2014)
Database = entities + relationship instances
Profession
User
Age
RMiner
Find all potentially interesting patterns
= all completely connected sets of entities
Rank the patterns for interestingness
26
Check-in
times
Relational Pattern Mining:
RMiner (Spyropoulou et al. 2014)
Database = entities + relationship instances
Profession
User
Age
P-N-RMiner
Find all potentially interesting patterns
= all completely connected sets of entities
Rank the patterns for interestingness
27
Check-in
times
Outline of algorithm
Instantiation of fix-point enumeration (Boley et al. 2010)
Works for any strongly accessible set system
All feasible generalisations of a set can be constructed by
adding one element at a time
For every set except the empty set, an element exists that if
removed, a feasible set is obtained
Proof for this in the paper
Closure operator is specific and dets. efficiency (optimal here?)
28
Ranking
29
Information-theoretic
interestingness
What makes a pattern interesting ?
≈ unexpected (frequency of) co-occurrence
= Surprisal = self-information = –log Pr(pattern)
Actually
self-information / description length
30
Interestingness
For experiment, we used marginals as prior beliefs
The user knows the ‘frequency’ of entities, but
not of the relations between entities
Background distribution is the maximum entropy
distribution given the prior beliefs as constraints
31
Interestingness
It is straightforward to include other knowledge
Users may want to input their own ‘beliefs’
This is why interestingness is subjective
(Beliefs need not even be correct)
Or incorporate a pattern after reading it
Iterative data mining / pattern set mining
32
Case studies
33
FoursquareCheng
check-ins
et al. (ICWSM 2011)
User
Check-In
Circadian
P1: 1.6% of the users checked in frequently
between [6am–7am], as well as [10.20am–10.50am]
P4: 4.5% of the users checked in frequently
between [1.10am–2.30am], [4.30pm–6.30pm], as
well as [8.30pm–9.30pm]
34
https://snap.stanford.edu/data/amazon-meta.html
Amazon book ratings
Rating
Customer
Book
Subject
P1: 23 customers and 8 books, all of which are
different versions of the book “Left Behind: A Novel
of the Earth’s Last Days”, a rating [1–5] and the
subjects Fiction and Christianity.
Amazon copies reviews between similar items
35
3
3
6
2.5
2.5
2
2
5
4
3
2
1.5
1
0.5
1
0
2
3
4
5
2
3
4
5
0
Sepal width
7
3
2.5
2
5
4
3
2
0
4
5
6
Sepal length
7
8
6
8
7
8
Petal length
2
1.5
1
0.5
1
1.5
4
2.5
Petal width
Petal length
3.5
2
3
6
4
1
0
1
Sepal width
4.5
1.5
0.5
0
1
Sepal width
Petal width
7
Petal width
Petal length
Subgroup discovery
0
4
5
6
Sepal length
7
8
4
5
6
Sepal length
How: take class label as mandatorily present
36
Subspace clustering
7
2.5
2.5
2
2
4
3
2
38
1.5
1
38
0.5
1
0
3
3.5
4
4.5
Sepal width
4.5
2.5
3
3.5
4
Petal length
3
2.5
5
6
Sepal length
7
8
4
6
8
7
8
Petal length
5
4
3
2
38
1.5
1
38
0.5
0
4
2
2
1
2
0
2.5
6
3.5
38
4.5
Sepal width
7
38
4
1
0
2
Petal width
2.5
1.5
0.5
0
2
Sepal width
Petal width
5
Petal width
Petal length
6
0
4
5
6
Sepal length
7
8
4
5
6
Sepal length
How: natural setting for P-N-RMiner
37
Summary
P-N-RMiner is a general solution for mining
interesting patterns in data, supporting
Relational data
Structured attributes
Subjective prior beliefs
Iterative data mining
38
Thank you!
39
Scalability
There is polynomial-time delay between two closure
steps (Spyropoulou et al. 2014)
We can capitalise on the partial order structures (this
paper)
That actually gives a noticeable speed-up (this paper)
However, no polynomial-time delay between two outputs
(proof will appear in follow-up)
This algorithm cannot be applied to very large data
40