Download [30] Data preprocessing. (a) Suppose a group of 12 students with

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
1.
[30] Data preprocessing.
(a) Suppose a group of 12 students with test scores listed as follows:
19, 71, 48, 63, 35, 85, 69, 81, 72, 88, 99, 95.
Partition these scores into four bins by
(1) equal-frequency (equi-depth) method,
(2) equal-width method,
(3) another (possibly better) method (such as clustering).
(b) What are the value ranges of the following normalization methods, respectively? (1) min-max
normalization, (2) z-score normalization, and (3) normalization by decimal scaling?
(c) The following table shows how many transactions containing beer and/or nuts among 10000
transactions. (1) calculate χ2, (2) calculate lift, (3) calculate all-confidence, and (4) based on your
calculation, how do you conclude the correlation between buying beer and buying nuts?
Nuts
No Nuts
Total
Beer
50
150
200
No Beer
800
9000
9800
Total
850
9150
10000
2.
[15] Data Warehousing and OLAP
Suppose a market shopping data warehouse consists of four dimensions: customer, date, product, and store,
and two measures: count, and avg sales, where count stores the number of transactions and avg sales stores
average sales in dollar.
(a) Draw a star (or snowflake) schema diagram (sketch it, make your assumptions on higher levels
of a dimension, for example, week, month, year for date).
(b) Starting with the base cuboid [customer, date, product, store], what specific OLAP operations
(e.g., roll-up student to department (level)) that one should perform in order to list the average
sales of each cosmetic product since January 2012 ?
(c) If each dimension has 5 levels (excluding all ), such as store-city-state-region-country, how many
cuboids does this cube contain (including base and apex cuboids)?
3.
[25] Data cube and data generalization
(a) Assume a base cuboid of N dimensions contains only p (where p > 3) nonempty base cells, and
there is no level associated with any dimension. Explain (1) what is the maximum number of
nonempty cells (including the cell in the base cuboid) possible in such a materialized data cube?
(2) if the minimum support (i.e., iceberg condition) is 3, what is the minimum number of nonempty
cells possible in the materialized iceberg cube?
(b) Among the following four methods: multiway array computation, BUC (bottom-up computation),
StarCubing, and shell-fragment approaches, which one is the best feasible choice in each of the
following cases?
(1) computing a dense full cube of low dimensionality (e.g., less than 8 dimensions),
(2) computing a large iceberg cube of around 10 dimensions, and
(3) computing a sparse iceberg cube of high dimensionality (e.g., over 100 dimensions).
(c) Explain why the data cube could be quite sparse, and how one should implement such a sparse
cube if one adopts an array implementation of data cube.
4.
[30] Frequent pattern mining.
(a) The following table shows a database of five transactions, where each transaction gives
Transaction ID (TID) and the items bought. Let min sup = 60% and min conf = 80%.
TID Items bought
100 I, P, A, D, B, C
200 D, A, E, F
300 C, D, B, E
400 B, A, C, K, D
500 A, G, T, C
(1) Show how the Apriori algorithm works step-by-step to discover frequent itemsets
(2) List all discovered association rules (with support and confidence)
(b) Suppose the user is only interested in frequent itemsets with one of the following constraints. State
the characteristics of each constraint and how to use the constraint to mine patterns efficiently.
(1) The sum of the price of all the items (in each pattern) is between $100 and $200
(2) Average price of all the items in each pattern is more than $20
(3) The price difference between any two items in each pattern is beyond $10.
(c) Briefly describe an efficient incremental frequent-itemset mining method which can incorporate
newly added transactions without redoing the mining from scratch. Formally, let F be a set of
frequent itemsets discovered from a data set D, let d be a set of newly added transactions, |d| <<
|D|. Can you mine frequent itemsets in D+d from F and d only (without scanning D)? If not
possible, can you mine frequent itemsets in D+d with less number of scans of D than Apriori?
.
------ This is the end of questions ------