Download Example: Data Mining for the NBA - The University of Texas at Dallas

Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Information Relevant to the Project September 21, 2005 Outline of the Unit  Project Information  Data Mining Techniques (may be useful for the project) Project Information  In this assignment, you will experiment on the following two tasks - Face recognition. - Pose recognition.  For this experiment you will use neural network package given in the code directory in location - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html  For training and testing, you will use the face images that can be found in the trainset directory in http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html  (OPTIONAL) You will also use k-nearest neighborhood using the same dataset and compare its performance with neural network. - Project Information (Continued)  You don’t need to do significant amounts of coding for the main part of the project. but for the OPTIONAL k-nearest neighbor we need some sort of coding, you will use k-nearest neighbor instead of back propagation in the code.  For training your datasets for part 1, it will take time. It is recommended that you read the materials in the following location thoroughly and start it early. http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html - - Project Information (Continued)  Face images  The image data can be found in the faces directory in - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html This directory contains 20 subdirectories, one for each person who volunteered for the photo shoot, named by userid. Each of these directories contains several versions of the face images.  For images the following naming convention is used: <userid>_<pose>_< expression>_<eyes>_<scale>.pgm  For further details see the assignment document, section 2 in the following link - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html - Project Information (Continued)  How to view the face images  You will need View.java and View.class to view the images. These two files will be placed on my web site by tomorrow  To view the images, you can use the java program View.java. View.java handles a variety of image formats, including the PGM format in which the face images are stored. While we won't go into detail about View in this document, we will quickly describe the basics you need to know to use View.  To start View, just specify on the command line: java View ImageInput - Here “ImageInput“ corresponds to the image you want to view - Project Information (Continued)  The neural network and image access code  You can have C code for a three layer fully connected feed forward neural network which uses the back propagation algorithm to tune its weights. You can also have the top level program for training and recognition, as a skeleton for you to modify.  The code is located in the code directory in - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html  For further details see in the assignment document, section 4 in the following link - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html Project Information (Continued)  Assignment  Copy straight_train.list, straight_test1.list, straight_test2.list, all_train.list, all_test1.list, all_test2.list in your home directory to obtain the training and test set data for this assignment from the following trainset link in - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html - Train with straight_train.list and test with straight_test1.list, straight_test2.list with the default learning parameter settings (learning rate 0.3, momentum 0.3) for 75 epochs. Project Information (Continued)  For further assistance see in assignment document, section 5-3 in the following link - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html Report your train/test performance and error as a function of epochs. If you had stopped training when the performance on test1 leveled off, what would the performance have been on test2? And vice versa?  Implement a face recognizer; i.e. implement a neural net which, when given an image as input, indicates who is in the image. To do this, you will need to implement a different output encoding. Describe your output encoding. (Hint: leave learning rate and momentum at 0.3, and use 20 hidden units). - Project Information (Continued)  Implement a pose recognizer; i.e. implement a neural net which, when given an image as input, indicates whether the person in the image is looking straight ahead, up, to the left, or to the right. You will also need to implement a different output encoding for this task. Describe your output encoding. (Hint: leave learning rate and momentum at 0.3, and use 6 hidden units).  Train with all_train.list and test with all_test1.list, all_test2.list with the default learning parameter settings (learning rate 0.3, momentum 0.3) for 100 epochs.  For further assistance see in assignment document, section 5-8 in the following link - http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M L94/face_homework.html Project Information (Concluded)  Report your train/test performance and error as a function of epochs. If you had stopped training when the performance on test1 leveled off, what would the performance have been on test2? And vice versa?  (OPTIONAL) Use k-nearest neighborhood to do all the tasks above. Here you will determine the best k value. Market Basket Analysis  What is Market Basket Analysis?  Example  Basic process  Association rules  Disassociation rules  Sequential Tim-series analysis  Challenges and Directions  Strengths/Weaknesses What is Market Basket Analysis?  Market basket analysis is a collection of techniques that will discover rules such as what items are purchased together  It has roots in point of sale transactions; but has gone beyond this applications - E.g., who travels together, who is seen with whom, etc.  Market basket analysis is used as a starting point when transactions data is available and we are not sure of the patterns we are looking for - Find items that are purchased together  Essentially market basket analysis produces association rules Example  Person Countries Visited  John England, France  James Germany, England, Switzerland  William England, Austria  Mary England, Austria, France  Jane Switzerland, France Co-Occurrence Table England Switzerland Germany France Austria England 4 1 1 2 2 Switzerland 1 2 1 1 0 Germany 1 1 1 0 0 France 2 1 0 3 1 Austria 2 0 0 1 2 Example (Concluded)  England and France are more likely to be traveled together than any other two countries  Austria is never traveled together with Germany or Switzerland  Germany is never traveled together with Austria or France  Rule: - If a person travels to France then he/she also travels to England Support for this rule is 2 out of 5 and that is 40% since 2 trips out of five support this rule Confidence for this rule is 66% since two out of three trips that contain France also contains England That is, if France then England rule has support 40% and confidence 66%  Challenge: How to automatically generate the rules Basic Process  Choosing the right set of items - Need to gather the right set of transaction data and the right level of detail, ensuring data quality  Generating rules from the data - Generate co-occurrence matrix for single items - Generate co-occurrence matrix with 2 items and use this to find rules with 2 items - Generate co-occurrence matrix with 3 items and use this to find rules with 3 items; etc. - - - - Algorithm Apriori  Overcoming practical limits imposed by thousand of items - Avoid combinatorial explosions Association Rules  Rules that find associations in data  Example of a association rule is (x1, x2, x3}  x4 meaning that if x1, x2, and x3 are purchased x4 is also purchased  Association rules have confidence values - Strong rules are rules with confidence value above a threshold  Challenge is to improve the apriority algorithms - E.g., Partition-based approach, sampling  Frequent-pattern growth method - First a databases projection is made of frequent items and then mine main memory data and build a data structure called a frequent pattern tree  Multidimensional association rule mining - Mining multidimensional transactional database Disassociation Rules  Similar to association, but a negative connector - E.g., if A and not B, then C - Whenever A is present and B is not present, C is present  Follow a similar process but introduce new set of items that are inverses of the original items  Because of the extra items, performance problems  Much of the work is focusing on association rules and not on disassociation rules Sequential Time Series Analysis  While market basket analysis focuses on associations that happen at one time, time series analysis focuses on associations over a period - E.g. After John goes to bank he goes to church  Transaction must have additional information such as time stamp or sequencing information and identifying information that identify different transactions belonging to the same customer  Market basket analysis techniques have been extended to include time series analysis Challenges and Directions  Performance improvements  Applying techniques for web mining including web content mining, web structure mining and web usage mining - Hits and Logsom algorithms - Mining path-traversal algorithm  Finding associations in text - Associations between words in a document or multiple documents - Text refining and transformations Strengths/Weaknesses  Strengths - Understanding of results - Supports data mining without knowing exactly what you want to find - Works on variable length data - Simple computations  Weaknesses - Computationally expensive - Determining the right data is not difficult Decision Trees  What is a Decision Tree?  Example  Some Requirements  Pruning Trees  Evaluating Trees  C4.5  Variations of Decision Tree methods  Strengths/Weaknesses What is a Decision Tree?  It is used for classification  Tree is drawn with root at the top and leaves at the bottom  Record enters the root, test is applied which child path to take - Depends on the test that is applied - Choose test that vest discriminates among the target classes  All records that end up at a given leaf are classified the same way  Essentially decision trees produce classifiers Example Religion = X? Yes No 20 < Age < 50? Yes Class A No Class B Class A: Highly suspicious Class B: Suspicious Class C: Mildly Suspicious Class D: Non Suspicious Country = Y? Yes Class C No Class D Example (Concluded)  Classification - Example 1:  Religion = X  Age = 25  Country = Y  Class = Highly suspicious - Example 2  Religion = Z (not equal to X)  Age = 28  Country = P (not equal to Y)  Class = Non suspicious Some Requirements  Some requirements have to be met for decision trees to work; they are listed below  Attribute-value description  Pre-defined classes  Discrete classes  Sufficient data  Logical classification methods Pruning Trees  Essentially discard one or more sub trees and replace them with leaves  There are various ways to prune trees - Examples include:  Decide not to divide a set of samples further under some condition using some statistical test  Remove some of the tree (after tree is constructed) using selected accuracy criteria; called post pruning  Well known C4.5 algorithms uses post pruning  Identify candidate sub trees and then determine how to prune Evaluating Sub-Trees  Candidate sub tress are selected for pruning  Next determine the sub trees that will best classify the data - Apply error rate in selecting the best sub tree  The best sub tree has to be at the right level; pruned enough but not over-pruned  The best sub tree may not work in all cases; may need alternate sub trees  May also need to take costs into account; for example what is the cost of misclassifying data C4.5  C4.5 is Ross Quinlan’s technique based on decision trees  It is a refinement of the 1986 technique called ID3  C4.5 is used in products such as Clementine  Uses post pruning approach, but uses a specific technique to estimate predicted error rate - Pessimistic pruning  The algorithms generates decision rules - E.g. If Attribute 1 = A and Attribute 2 = B, then Classification = Class C Variation of Decision Tree method: Multiple variables  Most common decision tree methods are C4.5, CART and CHAID - All three algorithms use test on a single variable to decide the split  In some cases we may need to use more than one variable for the test - E.g. age and gender to make decisions  Algorithms have been developed to use multiple attributes Strengths/Weaknesses  Strengths - Understandable rules - Classification without extensive computation - Indicate which fields are important for classification  Weakness - Error-prone with many classes - Expensive to train Neural Network  What is a neural network?  Example  Basic process  Types of Neural Networks  Back Propagation Learning  Key Issues in Learning  Neural networks for time series  Strengths/Weaknesses What is a Neural Network?  Neural networks consist of basic units modeled on biological neurons  Each unit has inputs that it combines into a single output  Main questions - What are units and how do they behave? - How are they connected together  Neural networks are trained using examples and then they are run on data sets  Based on the training they receive they carry out activities  Used both for directed and undirected data mining Example Living space Size of garage Age of house Appraised value Other Example (Continued) Features describing a house Feature # apts Yr built Plumb. fix Heating type Base gar. Att. Gar Living Deck Porch Recroom Basement Description # dwelling units Year built # plumbing fixtures Heating system type Basement garage attached garage Total living space Deck space Porch space Recreation space Basement space Values 1-3 1900 – 2000 5-17 coded as A or B 0-2 0-228 714-4185 0-738 0-452 0-672 9-810 Example (Continued) Training set example Feature Sales Price Months ago Num apts Yr built Plum. Fix Heat Type Base gar Att. Gar Living Deck Porch Recroom Basement Value 171,000 4 1 1923 9 A 0 120 1614 0 210 0 175 Example (Continued) Massaged Training set example Feature Sales Price Months ago Num apts Yr built Plum. Fix Heat Type Base gar Att. Gar Living Deck Porch Recroom Basement Original Value 171,000 4 1 1923 9 A 0 120 1614 0 210 0 175 Range of Values 103,000-250,000 0-23 1-3 1850-1986 5-17 A or B 0-2 0-228 714-4185 0-738 0-452 0-672 0-810 Massaged Value 0.4626 0.1739 0.0000 0.5328 0.3333 1.0000 0.0000 0.5263 0.2593 0.0000 0.4646 0.0000 0.2160 Example (Concluded)  Computing the Appraised Value  Example: Sales price - 103,000 is 0.0000 and 250,000 is 1.0000 - Therefore, 171,000 is 0.4626  Using some computation, neural network computes the appraised value to be: $213,250 - The output of the neural network is between 0 and 1. To get the appraised value of $213,250, the output will be 0.75 (the massaged output)  After the neural network is trained with examples, it will determine the formula then will compute the appraised value Basic Process  Inputs are given to the network  A combination function combines the inputs into a single value usually a weighted summation  A transfer function calculates the output value from the results of the combination function  The results is exactly one output between 0 and 1 Types of Neural Network  Feed forward neural network - Processing propagates from inputs to output without any loops - In a layered representation of feed forward network, there are no links in nodes within a layer - Hidden layers in-between input and output layers  Recurrent neural network - There is feedback link that forms a circular path in the network  Most widely used network is feed forward network with a back propagation learning algorithm Back Propagation Learning  Training the network is setting the best weights on the inputs of each units - Use weights where output is close to desired output  Consists of the following steps - Network gets training example and using the existing weights in the network it calculates the output or outputs of the example - Back propagation calculates error by taking the calculated results and actual result - Error is fed back through the network and the weights are adjusted t minimize error Key Issues  Choosing appropriate training sets  Training set needs to consider full range of values  Number of inputs and number of outputs – need good coverage  Need computational power  Preparing the input data  Features with continues and discretely values as well as categorical values  Need to interpret the results Neural Networks for Time Series Data  Data has time series - E.g., stock market values fluctuate - Need data mining techniques to predict the next value in the time series  Neural networks have been adapted to handle time series data - Network is trained using time series starting with oldest point, second oldest point etc. - Feed forward back propagation network  May take multiple time series inputs - E.g., Euro, Dollar etc.  Heuristics may be used for feed forward back propagation networks Strengths/Weaknesses  Strengths - Wide range of problems - Good results in complicated domains - Handle discrete, continuous, categorical values - Many packages  Weaknesses - Require inputs in the range 0 to 1 - Cannot explain results - May product solution prematurely Link Analysis  What is Link Analysis?  Example  Graph Theoretic Techniques  Some problems in graph theory  Examples  Link analysis and Visualization  Strengths/Weaknesses What is Link Analysis  Link analysis based on graph theory  Makes links between events, activities, places, objects - E.g., linking physicians to pharmaceutical companies  Association rules also find relationships, however using graph theoretic techniques one can find links not often found through other techniques  Link analysis is not mature, few products  There is much interest in using link analysis for counter-terrorism - Connecting the dots Example Example travel database given in Unite #10 Weighted graph where the edge weights are the number of trips Containing the countries represented by the nodes at either end England 1 2 1 1 France 1 Austria 1 Germany 1 Switzerland Example (Continued) Five calls link seven phone numbers 703-3658 781-5166 41 sec 23 sec 703-4271 703-3068 01 sec 42 sec 703-3108 1 min 22 sec 781-6595 202-1212 Example (Continued) Call Graph for 15 numbers and 19 calls 6 34 50 169 4 117 22 67 342 44 35 20 2 35 133 44 61 3 70 Example (Continued) Initial Call Graph: Some nodes labeled, short calls removed U 34 U 50 F 117 F U 35 U U 342 U 22 44 67 169 I 44 20 35 U U 133 61 U F U 70 U F: Fax I: Information U: Unknown Example (Concluded) Assigning Labels to Nodes S 34 U 50 F 117 F F 35 V F 342 U 22 44 67 169 I 44 20 35 V U 133 61 F F U 70 F Those connected initially to Fax are Fax F Those connected to Information are Voice V Those connected to both are Shared S Rest Unknown U Graph Theoretic Techniques  Graphs have nodes and edges  Earliest problem in graph theory is seven bridges of Konigsberg  Various types of graphs include directed graphs and undirected graphs  Directed graphs edges are one directional; undirected graphs they are both directional Some Problems in Graph Theory  Detecting cycles in graphs - Whether a directed graph has cycles - If no cycles, it is an acyclic graph  Pattern matching problems - Find recurrent occurrences of subgraphs in a graph  Graph Reduction - Reduce graphs and eliminate unnecessary nodes Link Analysis / Graph Reduction E A F B A D C Reduced Graph Original Graph F D Connecting the Dots Either a data miner auto matica lly connects the nodes ; or a federated ad minis trator co mbines the graphs carries out mining and connects the nodes; connections are s howed by darkened lines Link Analysis and Visualization  Link analysis is based on graphs and graphs can be visualized  With visualization one can detect patterns not usually detected in tables or other kinds of data  Visualization can be used on data and the link analysis may be performed on the concepts  Works in both directions: - Visualization for link analysis - Link analysis on visual data Strengths/Weaknesses  Strengths - Takes advantage of relationships - Useful for visualization  Weaknesses - Not applicable to many types of data - Few tools - May be difficult to implement Cluster Detection  What is cluster detection?  Example  Similarity Measures  Agglomeration Clustering  Partitional Clustering  Incremental Clustering  Evaluating Clusters  Strengths/Weaknesses What is Cluster Detection?  Clustering explores the interrelationships between samples  Examples: Customers in Cluster 1 purchased cars costing more than 50K and Customers in Cluster 2 purchased cars costing more than 100K  Various representations - Represent as cluster of points in an n-dimensional space - Represent graphically using nodes in a clustering tree - Using some logical expression on sample attributes  Usually samples of clustering are represented as a vector of measurements - Point in multidimensional space Example Initial seeds determine initial cluster boundaries seed3 X2 seed2 seed1 X1 Example (Continued) Calculating the Centroids of New Clusters X X2 X X X1 Example (Concluded) At Each Iteration, All Cluster Assignments are Reevaluated X X2 X X X1 Similarity Measures  We need a measure of similarity between two patterns drawn from the same feature space  This measure must be chosen carefully because the quality of clustering depends on this measure  One usually computes dissimilarity between two samples using a distance measure defined on the feature space  Distance measure may be metric or quasi metric and used to quantify the dissimilarity of samples Hierarchical Clustering  In hierarchical cluster analysis, the number of clusters is not specified as part of the inputs - That is, inputs is (X, s) where X is a set of samples and s is a measure of similarity  Output is hierarchy of clusters  Usually hierarchical clustering is not based on optimization, but funding approximate sub optimal solutions  Two types of algorithms - Divisible: Starts with entire set of samples and divides into subsets and divides each subset further - Agglomerative: Regards each object as an initial clusters; Clusters are merged into a coarser partition, continues until trivial partition is obtained Partitional Clustering  Partitional clustering algorithm obtains a single partition of the data instead of the clustering structure (e.g., dendogram), produced by hierarchical technique  They have advantages for large data sets as constructing dendograms may be computationally complex  Partitioning techniques produce clusters by optimizing a criterion function defined either locally (subset of samples) or globally (defines over all samples) Incremental Clustering  The other types of clustering work on main memory  There are applications where not all of the data can reside in main memory  Need incremental clustering techniques to solve this problem  Basic process - Data set in secondary memory and subsets of data clustered independently, followed by a merge of the clusters (divide and conquer) - Data is transferred to main memory one at a time for clustering and only clusters representations are stored in main memory - Parallel implementation may be used Evaluating Clusters  We need clusters whose members have a high degree of similarity  How do we determine whether a cluster is good – that is members are similar - Standard measure for within cluster similarity is variance; sum of the squared differences of each element from the mean  Need solutions that generate clusters with lowest variance - Another measure: Take whatever similarity measure or distance metric used to form clusters and compare the average distance within clusters to the average distance between clusters Strengths/Weaknesses  Strengths - Undirected knowledge discovery - Works well with numeric, categorical and textual data  Weaknesses - Difficult to choose right distance measures and weights - May be hard to interpret results Memory-Based Reasoning  What is Memory-based Reasoning (MBR)?  Example  Applying MBR  Distance Function  Combination Function  Challenges and Directions  Strengths/Weaknesses What is Memory-Based Reasoning?  Identifying similar cases based on experience and then applying the information from these cases to the problem at hand  Two functions: Distance function assigns a distance between any two records; combination function combines the results from the neighbors to arrive at an answer  Used for - Fraud detection: news cases of fraud are likely to be similar to known cases - Medical treatments: Most effective treatment is the one that resulted from the best outcomes for similar patients - Customer response prediction: Next customers are likely to respond to offers similar to the previous customers who have responded Example Scatter Graph of Events attended by Age and Source 4 D D C Events: A B C D Source 3 A 2 1 A AA A D B ABBB 25 30 A C D A A 35 40 C 45 50 Age Example (Concluded) To predict last event attended by 3 unknowns find where they land on the scatter graph and make a prediction based on nearest neighbor 4 D D C Events: A B C D Source 3 A 2 Nearest Neighbors 1 A AA A D B ABBB 25 30 A C 35 D A A 40 C 45 50 Age Applying MBR  Basic Idea - The idea is to use training set and for every value in the set compute a point in space - When a new entry arrives, compute the point in space for the new entry - The new entry will have properties similar to its neighbors  Steps - Choosing the Training Set - Determining the Distance Function - Choosing the number of nearest neighbors - Determining the combination function Distance Function  Distance is the way MBR measures similarity  Distance from A to B has four key properties - Well defined. The distance between 2 points is a non negative real number - Identity: Distance from one point to itself is zero - Commutative: Distance from A to B is the same as distance from B to A - Triangle inequality: Visiting an intermediate point C from A to B never shortens the distance  For MBR the points are records in a database; basis for measuring similarity - Every record has a neighbor somewhere in the database - Build a distance function one field at a time - Merge the distance functions for different fields Combination Function  Sometimes the nearest neighbor method may not work  Example: - A may be close to B; but A may be a millionaire while B may be a pauper - Tornado hits a region, but sun shines 5 miles away  Solution: Combination function - K nearest neighbors determine the answer - Compute distance function to each neighbor and select K nearest neighbors - Use some combination function (e.g., majority vote) and determine value for the new entry - E.g., Is the person likely to leave the company Strengths/Weaknesses  Strengths - Results are easily understood - Application to arbitrary data types; even nonrelational data - Works well with any number of fields - Minimum effort to maintain a training set  Weaknesses - Computationally intensive - Storage for training set - Dependence on distance function and combination function - http://www.ibia.org/membersadmin/whitepapers/pdf/17/En suring%20integrity%20with%20fingerprint%20verification. pdf Smart Card and Biometrics: Useful reference http://www.ibia.org/membersadmin/whitepapers/pdf/17/Ensu ring%20integrity%20with%20fingerprint%20verification.pd f

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Example: Data Mining for the NBA - The University of Texas at Dallas