Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Datamining Methods Mining Association Rules and Sequential Patterns P.Brezany Institut für Scientific Computing - Universität Wien 1 KDD (Knowledge Discovery in Databases) Process Clean, Collect, Summarize Operational Databases P.Brezany Data Warehouse Data Preparation Training Data Verification & Evaluation Institut für Scientific Computing - Universität Wien Data Mining Model, Patterns 2 Mining Association Rules • Association rule mining finds interesting association or correlation relationships among a large set of data items. • This can help in many business decision making processes: store layout, catalog design, and customer segmentation based on buying paterns. Another important field: medical applications. • Market basket analysis - a typical example of association rule mining. • How can we find association rules from large amounts of data? Which association rules are the most interesting. How can we help or guide the mining procedures? P.Brezany Institut für Scientific Computing - Universität Wien 3 Informal Introduction • Given a set of database transactions, where each transaction is a set of items, an association rule is an expression XY where X and Y are sets of items (literals). The intuitive meaning of the rule: transactions in the database which contain the items in X tend to also contain the items in Y. • Example: 98% of customers who purchase tires and auto accessories also buy some automotive services; here 98% is called the confidence of the rule. The support of the rule is the percentage of transactions that contain both X and Y. • The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and minimum confidence. P.Brezany Institut für Scientific Computing - Universität Wien 4 Basic Concepts Let J = (i1, i2, ..., im) be a set of items. Typically, the items are identifiers of individuals articles (products (e.g., bar codes). Let D, the task relevant data, be a set of database transactions where each transaction T is a set of items such that T J. Let A be a set of items: a transaction T is said to contain A if and only if A T, An association rule is an implication of the form A B, where A J, B J, and A B = . The rule A B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A B (i.e. both A and B). This is the probability, P(A B). P.Brezany Institut für Scientific Computing - Universität Wien 5 Basic Concepts (Cont.) The rule A B has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B - the conditional probability P(B|A). Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The occurence frequency of an itemset is the number of transactions that contain the itemset. P.Brezany Institut für Scientific Computing - Universität Wien 6 Basic Concepts - Example transaction 1 2 3 4 5 6 purchased items bread, coffee, milk, cake coffee, milk, cake bread, butter, coffee, milk milk, cake bread, cake bread X = {coffee, milk} R = {coffee, cake, milk} support of X = 3 from 6 = 50% support of R = 2 from 6 = 33% Support of “milk, coffee” “cake” equals to support of R = 33% Confidence of “milk, coffee” “cake” = 2 from 3 = 67% [=support(R)/support(X)] P.Brezany Institut für Scientific Computing - Universität Wien 7 Basic Concepts (Cont.) An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to the product of min_sup and the total number of transactions in D. The number of transactions required for the itemset to satisfy minimum support is therefore referred to as the minimum support count. If an itemset satisfy minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk. Association rule mining is a two-step process: 1. Find all frequent itemsets. 2. Generate strong association rules from the frequent itemsets. P.Brezany Institut für Scientific Computing - Universität Wien 8 Association Rule Classification • Based on the types of values handled in the rule: If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. For example: computer financial_management_software [support = 2%, confidence = 60%] If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. For example: age(X, “30..39”) and income(X,”42K..48K”) buys(X, high resolution TV) Note that the quantitative attributes, age and income, have been discretized. P.Brezany Institut für Scientific Computing - Universität Wien 9 Association Rule Classification (Cont.) • Based on the dimensions of data involved in the rule: If the items or attributes in an association rule reference only one dimension, then it is a single dimensional association rule. For example: buys(X,”computer”) buys (X, “financial management software”) The above rule refers to only one dimension, buys. If a rule references two or more dimensions, such as buys, time_of_transaction, and customer_category, then it is a multidimensional association rule. The second rule on the previous slide is a 3-dimensional ass. rule since it involves 3 dimensions: age, income, and buys. P.Brezany Institut für Scientific Computing - Universität Wien 10 Association Rule Classification (Cont.) • Based on the levels of abstractions involved in the rule set: Suppose that a set of association rules minded includes: age(X,”30..39”) buys(X, “laptop computer”) age(X,”30..39”) buys(X, “computer”) In the above rules, the items bought are referenced at different levels of abstraction. (E.g., “computer” is a higher-level abstraction of “laptop computer” .) Such rules are called multilevel association rules. Single-level association rules refer one abstraction level only. P.Brezany Institut für Scientific Computing - Universität Wien 11 Mining Single-Dimensional Boolean Association Rules from Transactional Databases This is the simplest form of association rules (used in market basket analysis. We present Apriori, a basic algorithm for finding frequent itemsets. Its name – it uses prior knowledge of frequent itemset properties (explained later). Apriori employs a iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-items, L1, is found. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. The Apriori property is used to reduce the search space. P.Brezany Institut für Scientific Computing - Universität Wien 12 The Apriori Property All nonempty subsets of a frequent itemset must also be frequent. If an itemset I does not satisfy the minimum support threshold, min_sup, then I is not frequent, that is, P(I) < min_sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I A ) cannot occur more frequently than I. Therefore, I A is not frequent either, that is, P (I A ) < min_sup. How is the Apriori property used in the algorithm? To understand this, let us look at how Lk-1 is used to find Lk. A two-step process is followed, consisting of join and prune actions. These steps are explained on the next slides, P.Brezany Institut für Scientific Computing - Universität Wien 13 The Apriori Algorithm – the Join Step To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself. This set of candidates is denoted by Ck. Let l1 and l2 be itemsets in Lk-1. The notation li[j] refers to the jth item in li (e.g., li[k-2] refers to the second to the last item in l1). Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The join Lk-1 join Lk-1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in common. That is, members l1 and l2 of Lk-1 are joined if (l1[1] = l2[1] ) (l1[2] = l2[2] ) ... (l1[k-2] = l2[k-2] ) (l1[k-1] < l2[k-1] ) . The condition (l1[k-1] < l2[k-1] ) simply ensures that no duplicates are generated. The resulting itemset: l1[1] l1[2] ) ... l1[k-1] l2[k-1] . P.Brezany Institut für Scientific Computing - Universität Wien 14 The Apriori Algorithm – the Join Step (2) Illustration by an example p Lk-1 = ( 1 2 3) || || Join: Result Ck = ( 1 2 3 4) || || q Lk-1 = ( 1 2 4) Each frequent k-itemset p is always extended by the last item of all frequent itemsets q which have the same first k-1 items as p . P.Brezany Institut für Scientific Computing - Universität Wien 15 The Apriori Algorithm – the Prune Step Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-items are included in Ck. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk. Ck can be huge, and so this could involve heavy computation. To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. The above subset testing can be done quickly by maintaining a hash tree of all frequent itemsets. P.Brezany Institut für Scientific Computing - Universität Wien 16 The Apriori Algorithm - Example Let’s look at a concrete example of Apriori, based on the AllElectronics transaction database D, shown below. There are nine transactions in this database, e.i., |D| = 9. We use the next figure to illustrate the finTID List of item_Ids ding of frequent itemsets T100 I1, I2, I5 in D. T200 I2, I4 P.Brezany T300 T400 T500 T600 T700 T800 T900 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Institut für Scientific Computing - Universität Wien 17 Generation of CK and LK (min.supp. count=2) Scan D for count of each candidate- scan Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Compare candidate support count with minimum support count - compare C1 Generate C2 candidates from L1 Itemset {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} C2 P.Brezany Scan Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 L1 Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I4} 1 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2,I5} 2 {I3,I4} 0 {I3,I5} 1 {I4,I5} 0 C2 Itemset Sup. count {I1,I2} 4 4 Compare {I1,I3} {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2, I5} 2 Institut für Scientific Computing - Universität Wien L2 18 Generation of CK and LK (min.supp. count=2) Generate C3 candidates from L2 Itemset {I1,I2,I3} {I1,I2,I5} C3 P.Brezany Scan Itemset Sup. Count {I1,I2,I3} 2 {I1,I2,I5} Compare 2 C3 Institut für Scientific Computing - Universität Wien Itemset Sup. Count {I1,I2,I3} 2 {I1,I2,I5} 2 L3 19 Algorithm Application Description 1 In the 1st iteration, each item is a member of C1. The algorithm simply scan all the transactions in order to count the number of occurrences of each item. 2 Suppose that the minimum transaction support count (min_sup = 2/9 = 22%). L1 can then be determined. 3 C2 = L1 join L1. 4 The transactions in D are scanned and the support count of each candidate itemset in C2 , as shown in the middle table of the second row in the last figure. 5 The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. P.Brezany Institut für Scientific Computing - Universität Wien 20 Algorithm Application Description (2) 6 The generation of C3 = L2 join L2 is detailed in the next figure. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3. 7 The transactions in D are scanned in order to determine L3 , consisting of those candidate 3-itemsets in C3 having minimum support. 8 C4 = L3 join L3 , after the pruning C4 = Ø. P.Brezany Institut für Scientific Computing - Universität Wien 21 Example: Generation C3 from L2 1. Join: C3 = L2 L2 = {{I1,I2},{I1,I3},{I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} = {{I1,I2,I3}, {I1,I2,I5}, {I1,I3,I5}, {I2,I3,I4}, {I2,I3,I5}, {I2,I4,I5}}. 2. Prune using the Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The 2-item subsets of {I1,I2,I3} are {I1,I2}, {I1,I3}, {I2,I3}, and they all are members of L2. Therefore, keep {I1,I2,I3} in C3. The 2-item subsets of {I1,I2,I5} are {I1,I2}, {I1,I5}, {I2,I5}, and they all are members of L2. Therefore, keep {I1,I2,I5} in C3. Using the same analysis remove other 3-items from C3. 3. Therefore, C3 = {{I1,I2,I3}, {I1,I2,I5}} after pruning. P.Brezany Institut für Scientific Computing - Universität Wien 22 Generating Association Rules from Frequent Items We generate strong association rules - they satisfy both minimum support and minimum confidence. support_count(A B) confidence ( A B ) = P(B|A) = ------------------------support_count(A) where support_count(A B) is the number of transactions containing the itemsets A B, and support_count(A) is the number of transactions containing the itemset A. P.Brezany Institut für Scientific Computing - Universität Wien 23 Generating Association Rules from Frequent Items (Cont.) Based on the equations on the previous slide, association rules can be generated as follows: - For each frequent itemset l , generate all nonempty subsets of l. - For every nonempty subset s of l, output the rule “s (l - s)” support_count(l) if ----------------- min_conf, where min_conf is minimum support_count(s) confidence threshold. P.Brezany Institut für Scientific Computing - Universität Wien 24 Generating Association Rules - Example Suppose that the transactional data for AllElectronics contain the frequent itemset l = {I1,I2,I5}. The resulting rules are: I1 I1 I2 I1 I2 I5 I2 I5 I5 I2 I1 I1 I5, I2, I1, I5, I5, I2, confidence confidence confidence confidence confidence confidence = 2/4 = 50% = 2/2 = 100% = 2/2 = 100% = 2/6 = 33% = 2/7 = 29% = 2/2 = 100% If the minimum confidence threshold is, say, 70%, then only the second, third, and the last rules above are output, since these are the only ones generated that are strong. P.Brezany Institut für Scientific Computing - Universität Wien 25 Multilevel (Generalized) Association Rules For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to sparsity of data in multidimensional space. Strong associations discovered at high concept levels may represent common sense knowledge. However, what may represent common sense to one user may seem novel to another. Therefore, data mining systems should provide capabilities to mine association rules at multiple levels of abstraction and traverse easily among different abstraction spaces. P.Brezany Institut für Scientific Computing - Universität Wien 26 Multilevel (Generalized) Association Rules - Example Suppose we are given the following task-relevant set of transactional data for sales at the computer department of an AllElectronics branch, showing the items purchased for each transaction TID. Table Transactions TID T1 T2 T3 T4 T5 . . . P.Brezany Items purchased IBM desktop computer, Sony b/w printer Microsoft educational software, Microsoft financial software Logitech mouse computer accessory, Ergoway wrist pad accessory IBM desktop computer, Microsoft financial software IBM desktop computer . . . Institut für Scientific Computing - Universität Wien 27 A Concept Hierarchy for our Example Level 0 computer desktop laptop all software educational financial IBM ... ... ... Microsoft ... ... ... printer color HP b/w Computer accessory wrist pad mouse ... Sony ...Ergoway ...Logitech ... Level 3 P.Brezany Institut für Scientific Computing - Universität Wien 28 Example (Cont.) The items in Table Transactions are at the lowest level of the concept hierarchy. It is difficult to find interesting purchase patterns at such raw or primitive level data. If, e.g., “IBM desktop computer” or “Sony b/w printer” each occurs in a very small fraction of the transactions, then it may be difficult to find strong associations involving such items. In other words, it is unlikely that the itemset “{IBM desktop computer, Sony b/w printer}” will satisfy minimum support. Itemsets containing generalized items, such as “{IBM desktop computer, b/w printer}” and “{computer, printer}” are more likely to have minimum support. Rules generated from association rule mining with concept hierarchies are called multiple-level or multilevel or generalized association rules. P.Brezany Institut für Scientific Computing - Universität Wien 29 Parallel Formulation of Association Rules • Need: – Huge Transaction Datasets (10s of TB) – Large Number of Candidates. • Data Distribution: – Partition the Transaction Database, or – Partition the Candidates, or – Both P.Brezany Institut für Scientific Computing - Universität Wien 30 Parallel Association Rules: Count Distribution (CD) • Each Processor has complete candidate hash tree. • Each Processor updates its hash tree with local data. • Each Processor participates in global reduction to get global counts of candidates in the hash tree. • Multiple database scans per iteration are required if hash tree too big for memory. P.Brezany Institut für Scientific Computing - Universität Wien 31 CD: Illustration P0 P1 P2 N/p N/p N/p {1,2} {1,3} {2,3} {3,4} {5,8} 2 5 3 7 2 {1,2} {1,3} {2,3} {3,4} {5,8} 7 3 1 1 9 {1,2} {1,3} {2,3} {3,4} {5,8} 0 2 8 2 6 Global Reduction of Counts P.Brezany Institut für Scientific Computing - Universität Wien 32 Parallel Association Rules: Data Distribution (DD) • Candidate set is partitioned among the processors. • Once local data has been partitioned, it is broadcast to all other processors. • High Communication Cost due to data movement. • Redundant work due to multiple traversals of the hash trees. P.Brezany Institut für Scientific Computing - Universität Wien 33 DD: Illustration P0 N/p P1 Remote Data N/p Data Broadcast P2 Remote Data N/p Remote Data Count Count Count {1,2} 9 {1,3} 10 {2,3} 12 {3,4} 10 {5,8} 17 All-to-All Broadcast of Candidates P.Brezany Institut für Scientific Computing - Universität Wien 34 Predictive Model Markup Language – PMML and Visualization P.Brezany Institut für Scientific Computing - Universität Wien 35 Predictive Model Markup Language PMML • Markup language (XML) to describe data mining models • PMML describes: – the inputs to data mining models – the transformations used prior to prepare data for data mining – The parameters which define the models themselves P.Brezany Institut für Scientific Computing - Universität Wien 36 PMML 2.1 – Association Rules (1) 1.Model attributes (1) <xs:element name="AssociationModel"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" <xs:element ref="MiningSchema" /> <xs:element minOccurs="0" maxOccurs="unbounded" <xs:element minOccurs="0" maxOccurs="unbounded" <xs:element minOccurs="0" maxOccurs="unbounded" <xs:element minOccurs="0" maxOccurs="unbounded" </xs:sequence> ref="Extension" /> ref="Item" /> ref="Itemset" /> ref="AssociationRule" /> ref="Extension" /> … P.Brezany Institut für Scientific Computing - Universität Wien 37 PMML 2.1 – Association Rules (2) 1. Model attributes (2) <xs:attribute name="modelName" type="xs:string" /> <xs:attribute name="functionName" type="MINING-FUNCTION“ use="required"/> <xs:attribute name="algorithmName" type="xs:string" /> <xs:attribute name="numberOfTransactions" type="INT-NUMBER" use="required" /> <xs:attribute name="maxNumberOfItemsPerTA" type="INT-NUMBER" /> <xs:attribute name="avgNumberOfItemsPerTA" type="REAL-NUMBER" /> <xs:attribute name="minimumSupport" type="PROB-NUMBER" use="required" /> <xs:attribute name="minimumConfidence" type="PROB-NUMBER" use="required" /> <xs:attribute name="lengthLimit" type="INT-NUMBER" /> <xs:attribute name="numberOfItems" type="INT-NUMBER" use="required" /> <xs:attribute name="numberOfItemsets" type="INT-NUMBER" use="required" /> <xs:attribute name="numberOfRules" type="INT-NUMBER" use="required" /> </xs:complexType> </xs:element> P.Brezany Institut für Scientific Computing - Universität Wien 38 PMML 2.1 – Association Rules (3) 2. Items <xs:element name="Item"> <xs:complexType> <xs:attribute name="id" type="xs:string" use="required" /> <xs:attribute name="value" type="xs:string" use="required" /> <xs:attribute name="mappedValue" type="xs:string" /> <xs:attribute name="weight" type="REAL-NUMBER" /> </xs:complexType> </xs:element> P.Brezany Institut für Scientific Computing - Universität Wien 39 PMML 2.1 – Association Rules (4) 3. ItemSets <xs:element name="Itemset"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="ItemRef“ /> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension“ /> </xs:sequence> <xs:attribute name="id" type="xs:string" use="required" /> <xs:attribute name="support" type="PROB-NUMBER" /> <xs:attribute name="numberOfItems" type="INT-NUMBER" /> </xs:complexType> </xs:element> P.Brezany Institut für Scientific Computing - Universität Wien 40 PMML 2.1 – Association Rules (5) 4. AssociationRules <xs:element name="AssociationRule"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" /> </xs:sequence> <xs:attribute name="support" type="PROB-NUMBER" use="required" /> <xs:attribute name="confidence" type="PROB-NUMBER" use="required" /> <xs:attribute name="antecedent" type="xs:string" use="required" /> <xs:attribute name="consequent" type="xs:string" use="required" /> </xs:complexType> </xs:element> P.Brezany Institut für Scientific Computing - Universität Wien 41 PMML example model for AssociationRules (1) <?xml version="1.0" ?> <PMML version="2.1" > <DataDictionary numberOfFields="2" > <DataField name="transaction" optype="categorical" /> <DataField name="item" optype="categorical" /> </DataDictionary> <AssociationModel functionName="associationRules" numberOfTransactions="4" numberOfItems=“4" minimumSupport="0.6" minimumConfidence="0.3" numberOfItemsets=“7" numberOfRules=“3"> <MiningSchema> <MiningField name="transaction"/> <MiningField name="item"/> </MiningSchema> P.Brezany Institut für Scientific Computing - Universität Wien 42 PMML example model for AssociationRules (2) <!-- four items - input data --> <Item id="1" value=“PC" /> <Item id="2" value=“Monitor" /> <Item id="3" value=“Printer" /> <Item id=“4" value=“Notebook" /> <!-- three frequent 1-itemsets --> <Itemset id="1" support="1.0" numberOfItems="1"> <ItemRef itemRef="1" /> </Itemset> <Itemset id="2" support="1.0" numberOfItems="1"> <ItemRef itemRef=“2" /> </Itemset> <Itemset id=“3" support="1.0" numberOfItems="1"> <ItemRef itemRef="3" /> </Itemset> P.Brezany Institut für Scientific Computing - Universität Wien 43 PMML example model for AssociationRules (3) <!-- three frequent 2-itemset --> <Itemset id=“4" support="1.0" numberOfItems="2"> <ItemRef itemRef="1" /> <ItemRef itemRef=“2" /> </Itemset> <Itemset id=“5" support="1.0" numberOfItems="2"> <ItemRef itemRef="1" /> <ItemRef itemRef=“3" /> </Itemset> <Itemset id=“6" support="1.0" numberOfItems="2"> <ItemRef itemRef=“2" /> <ItemRef itemRef="3" /> </Itemset> P.Brezany Institut für Scientific Computing - Universität Wien 44 PMML example model for AssociationRules (4) <!-- one frequent 3-itemset --> <Itemset id=“7" support="0.9" numberOfItems=“3"> <ItemRef itemRef="1" /> <ItemRef itemRef=“2" /> <ItemRef itemRef="3" /> </Itemset> <!-- three rules satisfy the requirements – the output --> <AssociationRule support="0.9“ confidence="0.85“ antecedent=“4" consequent=“3" /> <AssociationRule support="0.9" confidence="0.75" antecedent=“1" consequent=“6" /> <AssociationRule support="0.9" confidence="0.70" antecedent=“6" consequent="1" /> </AssociationModel> </PMML> P.Brezany Institut für Scientific Computing - Universität Wien 45 Visualization of Association Rules (1) 1. Table Format P.Brezany Antecedent Consequent Support Confidence PC, Monitor Printer 90% 85% PC Printer, Monitor 90% 75% Printer, Monitor PC 80% 70% Institut für Scientific Computing - Universität Wien 46 Visualization of Association Rules (2) 2. Directed Graph PC Printer Monitor Monitor PC Printer Printer PC Monitor P.Brezany Institut für Scientific Computing - Universität Wien 47 Visualization of Association Rules (3) 3. 3-D Visualisation P.Brezany Institut für Scientific Computing - Universität Wien 48 Mining Sequential Patterns (Mining Sequential Associations) P.Brezany Institut für Scientific Computing - Universität Wien 49 Mining Sequential Patterns • Discovering sequential patterns is a relatively new data mining problem. • The input data is a set of sequences, called datasequences. • Each data-sequence is a list of transactions where each transaction is a set of items. Typically, there is a transaction time associated with each transaction. • A sequential pattern also consists of a list of sets of items. • The problem is to find all sequential patterns with a user-specified minimum support , where the support of a sequential pattern is a percentage of data-sequences that contain the pattern. P.Brezany Institut für Scientific Computing - Universität Wien 50 Application Examples • Book club Each data sequence may correspond to all book selections of a customer, and each transaction corresponds to the books selected by the customer in one order. A sequential pattern may be “5% of customers bough `Foundation´, then `Foundation and Empire´ and then `Second Foundation´”. The data sequences corresponding to a customer who bought some other books in between these books still contains this sequential pattern. • Medical domain A data sequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding to the symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered could be used in disease research to help identify symptoms diseases that precede certain diseases. P.Brezany Institut für Scientific Computing - Universität Wien 51 Discovering Sequential Associations Given: A set of objects with associated event occurrences. events 2 1 1 5 9 3 Object 1 1 3 2 4 3 timeline 10 Object 2 P.Brezany 4 6 8 20 30 40 1 50 8 2 3 Institut für Scientific Computing - Universität Wien 52 Problem Statement We are given a database D of customer transactions. Each transaction consists of the following fields: customer-id, transaction-time, and the items purchased in the transaction. No customer has more than one transaction with the same transaction time. We do not consider quantities of items bought in a transaction: each item is a binary variable representing whether an item was bought or not. A sequence is an ordered list of itemsets. We denote an itemset i by (i1 i2 ... im ), where ij is an item. We denote a sequence s by <s1 s2 ... sn>, where sj is an itemset. A sequence <a1 a2 ... an> is contained in another sequence <b1 b2 ... bm> if there exist integers i1 < i2 < in such that a1 bi1, a2 bi2, ..., an bin. P.Brezany Institut für Scientific Computing - Universität Wien 53 Problem Statement (2) For example, <(3) (4 5) (8)> is contained in <(7) (3 8) (9) (4 5 6) (8)>, since (3) (3 8), (4 5) (4 5 6) and (8) (8). However, the sequence <(3) (5)> is not contained in <(3 5)> (an vice versa). The former represents items 3 and 5 being bought one after the other, while the latter represents items 3 and 5 being bought together. In a set of sequences, a sequence s is maximal if s is not contained in any other sequence. Customer sequence - an itemset list of customer transactions ordered by increasing transaction time: <itemset(T1) itemset(T2) ... itemset(Tn)> P.Brezany Institut für Scientific Computing - Universität Wien 54 Problem Statement (3) A customer supports a sequence s if s is contained in the customer sequence for this customer. The support for a sequence is defined as the fraction of total customers who support this sequence. Given a database D customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such sequence represents a sequential pattern. We call a sequence satisfying the minimum support constraint a large sequence. See the next example. P.Brezany Institut für Scientific Computing - Universität Wien 55 Example Customer Id Transaction Time Items Bought 1 1 2 2 2 3 4 4 4 5 Database sorted by customer Id and transaction time June 25 ‘00 June 30 ‘00 June 10 ‘00 June 15 ‘00 June 20 ‘00 June 25 ‘00 June 25 ‘00 June 30 ‘00 July 25 ‘00 June 12 ‘00 30 90 10, 20 30 40, 60, 70 30, 50, 70 30 40, 70 90 90 Customer Id Custom Sequence Customer-sequence version of the database P.Brezany 1 2 3 4 5 <(30) (90)> <(10 20) (30) (40 60 70)> <(30 50 70)> <(30) (40 70) (90)> <(90)> Institut für Scientific Computing - Universität Wien 56 Example (2) With minimum support set to 25%, i.e., a minimum support of 2 customers, two sequences: <(30) (90)> and <(30) (40 70)> are maximal among those satisfying the support constraint, and are the desired sequential patterns. <(30) (90)> is supported by customers 1 and 4. Customer 4 buys items (40 70) in between items 30 and 90, but supports the patterns <(30) (90)> since we are looking for patterns that are not necessarily contiguous. <(30 (40 70)> is supported by customers 2 and 4. Customer 2 buys 60 along with 40 and 70, but suports this pattern since (40 70) is a subset of (40 60 70). E.g. the sequence <(10 20) (30)> does not have minimal support; it is only supported by customer 2. The sequences <(30)>, <(40)> <(70)>, <(90)>, <(30) (40)>, <(30) (70)> and <(40 70)> have minimum support, they are not maximal - therefore, they are not in the answer. 57 P.Brezany Institut für Scientific Computing - Universität Wien