Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An example of a sparse 0-1 dataset Algorithmic aspects of data mining, fall 2005 Heikki Mannila Example dataset • http://fimi.cs.helsinki.fi/data/ • Retail: dataset was donated by Tom Brijs and contains the (anonymized) retail market basket data from an anonymous Belgian retail store. • The data are provided ’as is’. Basically, any use of the data is allowed as long as the proper acknowledgment is provided and a copy of the work is provided to Tom Brijs. More details can be found here. • http://fimi.cs.helsinki.fi/data/retail.pdf Algorithmic aspects of data mining, fall 2005 Heikki Mannila 1 Data format head retail.dat 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 38 39 47 48 38 39 48 49 50 51 52 53 54 55 56 57 58 32 41 59 60 61 62 3 39 48 63 64 65 66 67 68 32 69 Each row lists the variables that are set to 1 for that observation Algorithmic aspects of data mining, fall 2005 Heikki Mannila Basic statistics • wc retail.dat 88162 rows 908576 words • tr ''' \n' < retail.dat | sort -n -r | head -1 16469 • 88000 observations, 16500 variables • On the average 10.3 variables set to 1 per observation • 0.06 % of the entries are 1, the rest are 0 Algorithmic aspects of data mining, fall 2005 Heikki Mannila 2 What is there in the data? • Descriptive statistics • Association rules • Decomposition approaches – Principal components etc. Algorithmic aspects of data mining, fall 2005 Heikki Mannila How often a variable is 1? Number of occurrences of variables 4 6 x 10 Number of occurrences 5 4 3 2 1 0 0 2000 4000 6000 8000 10000 12000 Variable Algorithmic aspects of data mining, fall 2005 Heikki Mannila 14000 16000 18000 3 How often a variable is 1? Number of occurrences of variables 4 6 x 10 Number of occurrences 5 4 3 2 1 0 0 2000 4000 6000 8000 10000 12000 Variable, frequency order Algorithmic aspects of data mining, fall 2005inHeikki Mannila 14000 16000 18000 How often a variable is 1? Log of the number of occurrences of variables 16 14 Log2(number of occurrences) 12 10 8 6 4 2 0 0 2000 4000 6000 8000 10000 12000 frequency order Algorithmic aspects of data mining, fallVariable, 2005 in Heikki Mannila 14000 16000 18000 4 Number of ones per row Number ones per row 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 4 x 10 Algorithmic aspects of data mining, fall 2005 Heikki Mannila Number of ones per row Number ones per row, sorted 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 4 x 10 Algorithmic aspects of data mining, fall 2005 Heikki Mannila 5 Number of rows with a given number of ones Number of rows with a given number of ones 8000 7000 6000 5000 4000 3000 2000 1000 0 0 10 20 30 40 50 60 Algorithmic aspects of data mining, fall 2005 Heikki Mannila 70 80 Correlations Histogram of the correlations among the first 1000 variables 5 9 x 10 8 7 6 5 4 3 2 1 0 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 Algorithmic aspects of data mining, fall 2005 Heikki Mannila 6 Correlations Histogram of the correlations among the first 1000 variables 5 x 10 8 7 6 5 4 3 2 1 0 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 Algorithmic aspects of data mining, fall 2005 Heikki Mannila Histogram of correlations Log(Histogram) of the correlations among the first 1000 variables 20 18 16 14 Log of count 12 10 8 6 4 2 0 −0.2 0 0.2 0.4 0.6 Correlation Algorithmic aspects of data mining, fall 2005 Heikki Mannila 0.8 1 1.2 7 Frequent sets and rules from the retail data set • Look at occurrence thresholds 5000, 2000, 1000, 500, 400, 300, 200, 100 • Rules with accuracy at least 0.9 • Bart Goethal’s implementation • http://www.adrem.ua.ac.be/~goethals/software/index.html Algorithmic aspects of data mining, fall 2005 Heikki Mannila Rules with threshold 1000 • • • • • • • • • • • • • • 36 => 38 36 39 => 38 36 39 48 => 38 36 48 => 38 37 => 38 39 48 110 => 38 39 48 170 => 38 39 110 => 38 39 170 => 38 48 110 => 38 48 170 => 38 110 => 38 170 => 38 286 => 38 (2790, 0.950273) (1945, 0.954836) (1080, 0.967742) (1360, 0.960452) (1046, 0.973929) (1031, 0.994214) (1193, 0.989221) (1740, 0.989198) (2019, 0.980573) (1361, 0.986232) (1538, 0.987797) (2725, 0.975304) (3031, 0.978057) (1116, 0.943364) Algorithmic aspects of data mining, fall 2005 Heikki Mannila = f(38 286)/ f(286) = f(38 286) 8 Interestingness of rules • How interesting is this rule? • f(38)=15596 • What would have been the expected accuracy of the rule? Algorithmic aspects of data mining, fall 2005 Heikki Mannila Number of frequent sets Threshold Frequent sets Rules with accuracy > 0.9 5000 16 0 2000 46 4 1000 136 14 500 469 32 400 700 44 300 1136 32 200 2192 100 100 6452 220 Algorithmic aspects of data mining, fall 2005 Heikki Mannila 9