Download An example of a sparse 0-1 dataset Example dataset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
An example of a sparse
0-1 dataset
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
Example dataset
• http://fimi.cs.helsinki.fi/data/
• Retail: dataset was donated by Tom Brijs and
contains the (anonymized) retail market basket
data from an anonymous Belgian retail store.
• The data are provided ’as is’. Basically, any use
of the data is allowed as long as the proper
acknowledgment is provided and a copy of the
work is provided to Tom Brijs.
More details can be found here.
• http://fimi.cs.helsinki.fi/data/retail.pdf
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
1
Data format
head retail.dat
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48
38 39 48 49 50 51 52 53 54 55 56 57 58
32 41 59 60 61 62
3 39 48
63 64 65 66 67 68
32 69
Each row lists the variables that are
set to 1 for that observation
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
Basic statistics
• wc retail.dat
88162 rows 908576 words
• tr '''
\n'
< retail.dat | sort -n -r | head -1
16469
• 88000 observations, 16500 variables
• On the average 10.3 variables set to 1 per
observation
• 0.06 % of the entries are 1, the rest are 0
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
2
What is there in the data?
• Descriptive statistics
• Association rules
• Decomposition approaches
– Principal components etc.
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
How often a variable is 1?
Number of occurrences of variables
4
6
x 10
Number of occurrences
5
4
3
2
1
0
0
2000
4000
6000
8000
10000
12000
Variable
Algorithmic aspects of data mining, fall 2005
Heikki Mannila
14000
16000
18000
3
How often a variable is 1?
Number of occurrences of variables
4
6
x 10
Number of occurrences
5
4
3
2
1
0
0
2000
4000
6000
8000
10000
12000
Variable,
frequency
order
Algorithmic aspects of data mining, fall
2005inHeikki
Mannila
14000
16000
18000
How often a variable is 1?
Log of the number of occurrences of variables
16
14
Log2(number of occurrences)
12
10
8
6
4
2
0
0
2000
4000
6000
8000
10000
12000
frequency
order
Algorithmic aspects of data mining, fallVariable,
2005 in
Heikki
Mannila
14000
16000
18000
4
Number of ones per row
Number ones per row
80
70
60
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9
4
x 10
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
Number of ones per row
Number ones per row, sorted
80
70
60
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9
4
x 10
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
5
Number of rows with a given
number of ones
Number of rows with a given number of ones
8000
7000
6000
5000
4000
3000
2000
1000
0
0
10
20
30
40
50
60
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
70
80
Correlations
Histogram of the correlations among the first 1000 variables
5
9
x 10
8
7
6
5
4
3
2
1
0
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
6
Correlations
Histogram of the correlations among the first 1000 variables
5
x 10
8
7
6
5
4
3
2
1
0
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
Histogram of correlations
Log(Histogram) of the correlations among the first 1000 variables
20
18
16
14
Log of count
12
10
8
6
4
2
0
−0.2
0
0.2
0.4
0.6
Correlation
Algorithmic aspects of data mining, fall 2005
Heikki Mannila
0.8
1
1.2
7
Frequent sets and rules
from the retail data set
• Look at occurrence thresholds 5000, 2000, 1000, 500,
400, 300, 200, 100
• Rules with accuracy at least 0.9
• Bart Goethal’s implementation
•
http://www.adrem.ua.ac.be/~goethals/software/index.html
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
Rules with threshold 1000
•
•
•
•
•
•
•
•
•
•
•
•
•
•
36 => 38
36 39 => 38
36 39 48 => 38
36 48 => 38
37 => 38
39 48 110 => 38
39 48 170 => 38
39 110 => 38
39 170 => 38
48 110 => 38
48 170 => 38
110 => 38
170 => 38
286 => 38
(2790, 0.950273)
(1945, 0.954836)
(1080, 0.967742)
(1360, 0.960452)
(1046, 0.973929)
(1031, 0.994214)
(1193, 0.989221)
(1740, 0.989198)
(2019, 0.980573)
(1361, 0.986232)
(1538, 0.987797)
(2725, 0.975304)
(3031, 0.978057)
(1116, 0.943364)
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
= f(38 286)/ f(286)
= f(38 286)
8
Interestingness of rules
• How interesting is this rule?
• f(38)=15596
• What would have been the expected accuracy of the
rule?
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
Number of frequent sets
Threshold
Frequent sets
Rules with
accuracy > 0.9
5000
16
0
2000
46
4
1000
136
14
500
469
32
400
700
44
300
1136
32
200
2192
100
100
6452
220
Algorithmic aspects of data mining, fall 2005 Heikki Mannila
9
Related documents