Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Concept Animation Library By Nisarg Shah Project Advisor Dr. Meiliu Lu Department of Computer Science California State University, Sacramento Spring 2008 Agenda z z z z z z z z z z 2 Motivation Scope of the Project Background Knowledge Approach Implementation Details Demo Lessons Learned / Reinforced Future work References Q&A / Feedback 3/10/2008 Data Mining Concept Animation Library Motivation z CSc177 – Data Mining courseware – – z Idea from another Masters project – z Operating System Concept Animation Library [2] Own experience – – 3 Another students’ work in CSc212 courseware [1] Something that helps students learning the material better Having gone through pain of understanding Data Mining algorithms myself Not much of Graphical & Interactive stuff available outside 3/10/2008 Data Mining Concept Animation Library Scope of the Project z Data Mining Concept Animation Library – A collection of Data Mining algorithms with graphical and interactive user interface – Basic idea: Students can learn, understand and compare different algorithms – Plenty of algorithms and impossible to cover all of them – Just a start … z z 4 Apriori algorithm Frequent Patten (FP) Growth algorithm 3/10/2008 Data Mining Concept Animation Library Background Knowledge Apriori Algorithm z Apriori pruning principle: – z 5 If there is any itemset which is infrequent, its superset should not be generated/tested! Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets z Step 1: self-joining Lk z Step 2: pruning – Test the candidates against DB – Terminate when no frequent or candidate set can be generated 3/10/2008 Data Mining Concept Animation Library Background Knowledge (2) Apriori Example Database (Supmin = 2) Tid 10 20 30 40 L2 Items A, C, D B, C, E A, B, C, E B, E Itemset {A, C} {B, C} {B, E} {C, E} C3 6 C1 1st scan sup 2 2 3 2 Itemset {B, C, E} C2 Itemset {A} {B} {C} sup 2 3 3 {D} {E} 1 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 3rd scan L1 sup 1 2 1 2 3 2 L3 Itemset {A} {B} {C} {E} C2 nd 2 scan Itemset {B, C, E} sup 2 sup 2 3 3 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Background Knowledge (3) Frequent Pattern Growth Algorithm z Mining Frequent Patterns Without Candidate Generation z Two-step method – Construct FP-tree from a Transaction Database z z z – Find Patterns Having P From P-conditional Database z z z 7 Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order, f-list Scan DB again, construct FP-tree Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base 3/10/2008 Data Mining Concept Animation Library Background Knowledge (4) FP Growth Example – Step 1 Construct FP-tree from a Transaction Database TID Items bought Header Table 100 200 300 400 500 {f, a, c, d, g, i, m, p} {a, b, c, f, l, m, o} {b, f, h, j, o, w} {b, c, k, s, p} {a, f, c, e, l, p, m, n} Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} min_support = 3 300 {f, b} 400 {c, b, p} F-list=f-c-a-b-m-p 500 {f, c, a, m, p} 8 3/10/2008 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 Data Mining Concept Animation Library Background Knowledge (5) FP Growth Example – Step 2 Find Patterns Having P From P-conditional Database {} Conditional pattern bases Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 9 f:4 c:3 c:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 3/10/2008 b:1 item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Data Mining Concept Animation Library Approach z Gathering requirements – – – z In-depth understanding of the two algorithms Input from students/professor (couldn’t really do this in detail) Choosing the appropriate tools for implementation Feasibility test – – – Starting point – Algorithm pseudo code Console based application Dynamic algorithms in nature z 10 With configurable files for input parameters and transaction details 3/10/2008 Data Mining Concept Animation Library Approach (2) z GUI Application – Choosing right technology – Choosing right components – Setting limitations (max items, max transactions) – Simple and consistent background – Multiple algorithms under different tabs z 11 User can match output against each other 3/10/2008 Data Mining Concept Animation Library Implementation Details z 12 Why Applets? – An Applet is an program written in Java programming language that can be included in an HTML page – Works well with Java technology-enabled web browser – It can run at a comparable speed to other compiled languages such as C++, but many times faster than JavaScript – It can move the work from the server to the client, making a web solution more scalable with the number of users/clients 3/10/2008 Data Mining Concept Animation Library Implementation Details (2) z Problem - Counting support of Candidates – z How many times a particular candidate itemset (of any length) appears in the transaction table? Solution – – All possible candidate itemsets are generated and calculated only once – at the beginning Candidate itemsets are stored in a Hash table(key,val) pair z 13 Key=candidate itemset; val=count 3/10/2008 Data Mining Concept Animation Library Implementation Details (3) z Use of third party tool z Combination generator [4] – – Generates all possible combinations of given size for given itemset Example: Given Itemset {A,B,C,D} z 14 All possible 3-itemsets: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D} 3/10/2008 Data Mining Concept Animation Library Implementation Details (4) z Choosing between components – – 15 Purpose: display and edit regular 2-D tables of cells Options: JTable Vs JEditorPane z JTable – Rendering problem: refreshing screen on detecting table selections z JEditorPane – With the use of JCheckboxes 3/10/2008 Data Mining Concept Animation Library Implementation Details (5) z Problem: How to display FP-tree? z Solution: JGraph – 16 Takes the description of a graph as input, and produces a graph display on the standard output 3/10/2008 Data Mining Concept Animation Library Lessons Learned/Reinforced 17 z Don’t procrastinate (specially if you’re working fulltime) z Clear understanding of Apriori & FP growth algorithms z A good programming experience with Java and Applets z Demonstrations/graphical tools are useful for explaining concepts z Simple project idea can be meaningful 3/10/2008 Data Mining Concept Animation Library Future Work 18 z Get feedback and recommendations from potential users (students & instructor) z An open source library Data Mining Concept Animation Library z A comparison between multiple algorithms on same set of data z Potential idea for a course project or Bachelors/Masters project 3/10/2008 Data Mining Concept Animation Library References 1. XML Data Representation and Transformations for Bioinformatics http://athena.ecs.csus.edu/~woodsk/courseware/ 2. Operating System Concept Animation Library http://gaia.ecs.csus.edu/%7Ezhangd/oscal/oscal.htm 3. Data Mining Concepts and Techniques book by Jiawei Han and Micheline Kamber 4. Combination generator tool http://www.merriampark.com/comb.htm 19 3/10/2008 Data Mining Concept Animation Library Q&A / Feedback z z Questions ?? Feedback – Useful or not? – Any recommendations / suggestions? – Which algorithms/concepts from current material you would like to be implemented in such a tool? Thank you!!! 20 3/10/2008 Data Mining Concept Animation Library