Download 10_30_10_Cousin_ARM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Gene nomenclature wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
RoloDex Model
The Data Cube Model gives a great picture of relationships, but can become
gigantic (instances are bitmapped rather than listed, so there needs to be a position
for each potential instance, not just each extant instance).
The inefficiency described above is especially severe in the very common Bipartite Unipartite on Part (BUP) relationships.
Examples:
In Bioinformatics, bipartite relationships between genes (one entity) and
experiments or treatments (another entity) are studied in conjunction with
unipartite relationships on one of the gene part (e.g., gene-gene or protein-protein
interactions).
In Market Research, bipartite relationships between items and customers are
studied in conjunction with unipartite relationships on the customer part (or on
the product part, or both).
For this situation, the Relational Model provides no picture and the Data Cube
Model is too inefficient (requires that the unipartite relationship be redundantly
replicated for every instance of the other bi-part). We suggest the RoloDex Model.
The Bipartite, Unipartite-on-Part Experiment Gene Relationship, EGG
So as not to duplicate axes, this copy
of G should be folded over to
coincide with the other copy,
producing a "conical" unipartite card.
4
3
2
1
G
G 1 2 3 4
3
1
1
People Author
 Customer
termdoc card
1
4
gene
gene
card
(ppi)
3
2
1
1
1
1
1
docdoc
1
1
1
1
Gene
term  G 1 2 3 4 5 6 7
3
1
1
expgene
card
3
RoloDex
Model
4
5
Item
6
4
cust
item
card
1 PI2 32 43 45 56 7
1authordoc
1 1 1
1 card
1 1 1
itemset
itemset
card
5
3
2
1
ItemSet 1 2 3 4 5 6
16
antecedent
Most interestingness measure are
based on one of these supports.
In IR, df(t) = suppG({t}, tc(t,d));
tf(t,d) is the one histogram bar in
suppMG({t}, tc(t,d))
3
Doc
expPI
card
16
1
2
People 
Conf(AB)
Supp(A) =
CusFreq(ItemSet) =Supp(AB)/Supp(A)
ItemSet
 Axis-Card pair (Entity-Relationship pair), ac(a,b),  a
support count for AxisSets (or ratio or %):  A, for a graph
relationship, suppG(A, ac(a,b))=|{b:aA, (a,b)c}|
and for a multigraph, suppMG is the histogram over b of
(a,b)-EdgeCounts, aA. Other quantifiers can be used
also (e.g., the universal,  is used in MBR)
gene
gene
card
(ppi)
In MBR supp(I)=suppG(I. ic(i,t))
6
t
Gene
In MDA, suppMG(GSet, gc(g,e))
1
2
termterm card
(share stem?)
3
4
5
6
7
Of course all supports are inherited
redundantly by the card, c(a,b).
Cousin Association Rule Mining Approach (CARMA)
 card (RELATIONSHIP) c(I,T) one has
Association Rules among disjoint Isets, AC,
 A,C I, with A∩C=∅ and
Association Rules among disjoint Tsets, AC, A,C T, with A∩C=∅
Two measures of quality of AC are:
SUPP(A) ≡ |{ t | (i,t)E iA}|
SUPP(AC) where e.g., for any Iset, A,
CONF(AC) = SUPP(AC)/SUPP(A)
First Cousin Association Rules:
Given any card sharing an axis with the bipartite relationship, B(T,I), e.g., C(T,U)
Cousin Association Rules are those in which the antecedent, Tsets is generated by a subset,
S, of U as follows: {tT|uS such that (t,u)C} (note this should be called an "existential
first cousin AR" since we are using the existential quantifier. One can use the universal
quantifier (used in MBR ARs))
E.g., S  U, A=C(S), A'T then AA' is a CAR and we can also label it SA'
First Cousin Association Rules Once Removed (FCAR1Rs) are those in which both Tsets are
generated by another bipartite relationship and we can label antecedent and or the
consequent using the generating set or the Tset.
The Cousin Association Rule Mining Approach (CARMA)
Second Cousin Association Rules are those in which the antecedent Tset is generated by a
subset of an axis which shares a card with T, which shares the card, B, with I. 2CARs can
be denoted using the generating (second cousin) set or the Tset antecedent.
Second Cousin Association Rules once removed are those in which the antecedent Tset is
generated by a subset of an axis which shares a card with T, which shares the card, B, with I
and the consequent is generated by C(T,U) (a first cousin, Tset) . 2CAR-1rs can be denoted
using any combination of the generating (second cousin) set or the Tset antecedent and the
generating (first cousin) or Tset consequent.
Second Cousin Association Rules twice removed are those in which the antecedent Tset is
generated by a subset of an axis which shares a card with T, which shares the card, B, with I
and the consequent is generated by a subset of an axis which shares a card with T, which
shares another first cousin card with I. 2CAR-2rs can be denoted using any combination of
the generating (second cousin) set or the Tset antecedent and the generating (second cousin)
or Tset consequent. Note 2CAR-2rs are also 2CAR-1rs so they can be denoted as above also.
Third Cousin Association Rules are those....
We note that these definitions give us many opportunities to define quality measures
Item
Measuring CARMA Quality in the RoloDex Model
4
cust
item
card
People Author
 Customer
termdoc card
1
4
gene
gene
card
(ppi)
3
2
1
1
1
1 PI2 32 43 45 56 7
1authordoc
1 1 1
1 card
1 1 1
1
1
2
People 
term  G 1 2 3 4 5 6 7
3
For Distance CARMA relationships,
quality (e.g., supp or conf or???) can
be measured using information on
any/all cards along the relationship
(multiple cards can contribute
factors or terms or in some other
way???)
Doc
expPI
card
2
1
docdoc
1
1
1
1
Gene
1
3
3
1
1
expgene
card
3
4
5
gene
gene
card
(ppi)
6
t
Gene
1
2
termterm card
(share stem?)
3
4
5
6
7
Generalized CARMA: First, we propose definition of Generalized Association Rules
(GARs) which contains the standard "1 Entity Itemset" AR definition as a special case.
Association Pathway Mining (APM) is a DM technique (with application to bioinformatics?)
Given Relationships, R1,R2 (RoloDex cards) with shared Entity,E2, (axis), E1R1E2R2E3
and given AE1 and CE3, then AC , is a Generalized E2 Association Rule, with
SupportR1R2(AC) = | {tE2 | aA, (a,t)R1 and cC, (c,t)R2} |
ConfidenceR1R2(AC) = SupportR1R2(AC) / SupportR1(A) where as always,
SupportR1(A) = |{tE2|aA, (a,t)R1}|.
E3=E1, the GAR is a standard AR iff AC=.
Association Pathway Mining (APM) is the identification and assessment
(e.g., support, confidence, etc.)of chains of GARs in a RoloDex.
Restricting to the mining of cousin GARs reduces the number of strong rules or pathways links.
More generally,
A  E1R1E2R2E3  C
Support-SetR1R2(AC) = SSR1,R2(AC) = {tE2|aA (a,t)R1,cC (c,t)R2}
If E2 has real labels, Label-Weighted-SupportR1R2(AC) = LWSR1R2(AC) =tSSR
label(t)
1R2
Downward closure property of Support Sets:
1
1
1
SSR1R2
SS(A'C')  SS(AC) A'A, C'C
1
l2,3
l2,2
1
1
1
1
Therefore, if all labels are non-negative,
then LSW(AC)  LSW(A'C')
R1
E2
R3
(in order for LSW(AC) to exceed a
threshold is that all LSW(A'C')
E3
E1
exceed that threshold A'A, C'C).
A
C
So an Apriori-like frequent set pair mine would go as: Start with pairs of 1-sets (in E1 and E3).
The only candidate 2-antecedents with 1-consequents (equiv, 2-consequents with 1-antecedents)
would be those formed by joining ...
The weighted support concept can be extended to the case there R1 and/or R2 have labels as well.
Vertical methods can be applied by converting E2 to vertical format (E2 instances are the rows
and pertinent features from other cards/axes are "rolled over" to E2 as derived feature attributes