Download MIS 451/551, Spring 2000

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ACCTG 6910, Spring 2003
DESB, University of Utah
Final Exam (8 – 10 AM, May 1, 2003)
Close books and notes
Question 1 Data mining overview and applications
a) Describe the steps and their purposes in knowledge discovery from databases. (10
points)





Step 1: Selection: select interested columns (attributes) and rows (records) to be
mined.
Step 2: Cleaning: clean errors from selected data
Step 3: Transformation: data are transformed to be suitable for high performance
data mining
Step 4: Data mining: mine the transformed data and obtain patterns
Step 5: Interpretation and Evaluation: filter out non-interesting patterns from data
mining results
b) Effective data mining needs expert knowledge about the relevant business process
and knowledge about data attributes. Identify which activity requires more such
expert knowledge in each of the following question. Justify each answer. (10
pointes)
i.
Configuring a clustering task versus configuring a classification task
Classification task requires more expert knowledge since it requires the
expert gives the training and test dataset containing input attributes and
class label. These choices impact on the accuracy of classification findings
and computational requirements. That is, the classification uses supervised
learning to build the classification model. Expert knowledge is required to
identify and verify the appropriate class label in the training and testing
phase.
However, clustering task only needs input attributes. The expert only need
to provide appropriate training dataset containing only input attributes to
perform task. That is, clustering is a kind of unsupervised learning.
Therefore, less expert knowledge is required.
ii.
Deciding on actions to be taken based on clusters discovered from a
clustering task versus deciding on actions to be taken based on a decision
tree.
Following the rules derived from decision tree, the end users can easily get
the predicted class label for each new record and take the actions
accordingly. Little expert knowledge is not required during the process.
However, taking actions based on clusters obtained from clustering is
challenging. Expert knowledge, e.g., statistical knowledge, is required to
interpret the clustering results and distribution of input attributes’ values.
Moreover, the explanation and application of clusters is related to the
specific business problems. Therefore, business knowledge from domain
expert is often demanded.
c) Give a business intelligence application that benefits from data mining solutions
instead of OLAP queries. Be specific with the data mining method you
recommend and justify your answer. (10 points)
d) An Internet marketer is interesting in segmenting Internet with a clustering tool
using the input attributes – top ten search key words used, top 10 URLs, recent 10
online purchases (vendor, product, qty, amt), Internet usage level, heaviest access
hour, and heaviest access day of a week. Answer the following questions:
i.
Can we find users with different income level? Why or why not. (5 points)
We can not find users with different income level because there is no
income level information provided by above scenario. Without income
level provided as input attributes, it is impossible for us to find users with
different income level.
ii.
Can we expect to find clusters differentiated based on Internet usage
level? Why or why not. (5 points)
As Internet usage level is provided as input attribute, we can expect to find
clusters differentiated based on it. The clustering method will use the
Internet usage level as one of components of its distance measure to group
the similar users and ungroup the dissimilar users.
Question 2 Associations Rules
a) If {1, 2, 3} and {2, 3, 4} are the only large 3 itemsets, identify for each one of the
following sets if it is or is not a large itemset ,or you cannot be certain if it is a
large itemset or not. (10 points)
i.
{1}
Yes
ii. {1, 2}
Yes
iii. {1, 4}
No
iv.
{1, 2, 3, 4}
No
v.
{1, 3, 4}
No
b) Name and describe the property used to determine the answers in a). (5 points)
Apriori property: any non-empty subset of a large itemset must be large.
c) Assume that the confidence of the decision rule, 1-> 2, is 100%. Is the confidence
of the decision rule, 2-> 1, also 100%? Give an example of data to justify your
answer. (3 points)
It is unnecessary. For example,
Transaction
Items
1
1,2
2
1,2,3
3
2,3
4
1,2,4
5
2,3,4
Confidence for 1->2 is 100%
while confidence for 2->1 is 60%
d) Assume that the numerals in the following association rules and large sequences
identify different music files that customers downloaded on the Internet in the
same sessions or over multiple sessions. As a consultant to Amazon.com, make a
recommendation to your client based on each of the following association rule.
(12 points)
i.
1  2 with low support, high confidence and lift = n where n is large.
Because of large positive lift, the file1 and 2 are positively correlated
though the support is low, if customer downloads file 1, Amazon.com can
recommend customer also download file 2 in the same session.
ii. 1 2 with high support, high confidence and lift = 0.
This rule could be misleading since lift < 1. File 1 and 2 are negatively
related. Therefore, it is not reasonable to suggest customer to download
file 2 when they download file 1 in a session. However, it is reasonable to
recommend file 1 and 2 downloaded together because of high support.
iii. 1 2 with high support, high confidence and lift = -n where n is large.
This rule could be misleading since lift < 1. File 1 and 2 are negatively
related. Therefore, it is not reasonable to suggest customer download file 2
when they download file 1 in a session. It is reasonable to recommend file
1 and 2 downloaded together because of high support.
iv.
<{1, 2}, {3}> with high support
If the customer downloads file 1 and 2 in the one of previous sessions,
Amazon.com can recommend the customer to download file 3 in the next
session and pre-fetch music 3.
v.
<{1, 2, 3}, {4}> with high support
If the customer downloads file 1, 2 and 3 in the one of previous sessions,
Amazon.com can recommend the customer to download file 4 in the
following sessions and pre-fetch file 4.
Question 3 Clustering and Classification/Prediction
a) Compare two definitions (views) of prediction tasks. (5 points)


View 1
a. Classification: discovery
b. Prediction: predictive utilizing classification results (rules)
View 2
a. Either discovery or predictive
b. Classification: categorical or ordinal class labels
c. Prediction: numerical (continuous) class labels
b)
Compare the pros and cons of decision tree and neural network classification
methods. (7 points)

Decision Trees
a. Pros:
i. Clear Rules
ii. Fast Algorithm
iii. Scalable
b. Cons:
i. The accuracy may suffer with complex problems, e.g., a large
number of class labels

Neural Network:
a. Pros:
i. Very Powerful (ANY function!)
b. Cons:
i. Time - consuming
ii. Black-Box
c) Given the following pairs of credit ranking and fraud outcome in a data set, give
the formulation for entropy reduction and fill in the weights and probabilities in
the formulation for splitting the records by Ranking= L. (5 points)
1
2
3
4
5
6
7
8
9
Fraud
No
No
No
Yes
No
No No
Yes
No
Ranking
L
L
M
H
H
L
L
H
M
entropy reduction = E – E’
E = -(3/10 * log2(3/10) + 7/10 * log2(7/10))
= -(0.3 * log20.3+ 0.7 * log20.7)
E’=w1 * E1 + w2 * E2
= 4/10 * E1 + 6/10 * E2
= 0.4 * E1 + 0.6 * E2
E1 = -( 4/4 * log2(4/4) + 0/4 * log2(0/4))
= -( 1* log21 + 0 * log20)
E2 = -( 3/6 * log2(3/6) + 3/6 * log2(3/6))
= -( 0.5 * log20.5 + 0.5 * log20.5)
d)
Without calculating the value of any entropy reduction, give an intuitive
explanation why additional attributes should be included to increase the
classification accuracy in c). (5 points)
When the ranking is H or M, we cannot purely differ the fraud records from not
fraud records without additional attributes.
e) Describe two distance normalization and standardization methods. (8 points)
By normalization, you can set the minimum and maximum values for each input
attributes to be the same. The formula for every two objects is:
(actual distance of input attribute i between two objects) * (maximum normalized
distance – minimum normalized distance) / (maximum possible distance minimum possible distance)
By standardization, you can follow the steps as below to get the standardized
distance



Calculate the mean value
Calculate mean absolute deviation
Standardize each variable value as:
o Standardized value = (original value – mean value)/ mean absolute
deviation
10
Yes
M