Download IT 241 Information Discovery and Architecture Exam 3 Page

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IT 241
Information Discovery and Architecture Exam 3
Page 1
December 1, 2011
Name _____________________________
This exam is open book and on-line but no contact with a live person. Be careful with your time!
1. In what ways are data mining and data visualization connected? How does one support the other?
[10 pts]
Both reveal/confirm relationships of attributes in data; both allow ways to drill into data
Data mining can determine attributes to visualize; the output of the data mining may be visualized
Visualization may allow you to see patterns and confirm the patters in the data mining such as correlation matrices
2. Describe briefly the process(es) of data warehousing used in preparation for data mining. Do not just narrate the data
warehousing architecture diagram.
[5 pts]
Identify data to be archived (not needed for day-to-day operations.
Clean data, determine attributes not needed for further data mining. Some aggregation/counts may be generated.
Denormalize into cube or star schema
3. Give an example of shallow knowledge as found in the national census data. Similarly give an example of hidden
knowledge as might be found in the national census data.
[5 pts]
Shallow – any simple query, counts, sums. Just return facts.
Deep/hidden – any data mining example, rules, cluster, regression.
IT 241
Information Discovery and Architecture Exam 3
Page 2
4. Data types in data mining.
[15 pts]
a. Give an example of nominal data that cannot be converted to ordinal data.
Favorite color
b. Give an example of ordinal data that may be limited in conversion to numeric data.
Interest levels (likert, agree/disagree)
c. Give an example of a numeric data set that can be mapped to nominal data. Explain the mapping.
hat sizes, small medium large to numeric sizes
5. In the data preprocessing of the rainfall.txt file for cities in the US, describe 4 problems found in the data set and what
resolutions you took for those problems, before exporting it into CSV format.
[15 pts]
a. Duplicate data
b. Zeroes or missing data
c. Outliers, regarding city names
d. Columns may be irrelevant
6. Describe how you might determine which attributes you could eliminate from a data set with dozens of attributes?
[5 pts]
matrix of scatter plots
correlation matrix
attribute selection algorithms
IT 241
Information Discovery and Architecture Exam 3
Page 3
[15 pts]
7. Decision trees.
a. Given the decision tree rule for the above dataset
IF Sex=Male && WatchPromo=No
THEN CreditCardInsurance=No
Determine its accuracy = ___3 out of 4 or 75_% and its coverage = __4 out of 15 or 27_%
b. Draw a decision tree to correspond
with these three production rules.
(Not all leaves are defined.)
IF Sex=Female
THEN CreditCardInsurance=No
IF Sex=Male && WatchPromo=No
THEN CreditCardInsurance=No
IF Sex=Male && IncomeRange =30-40K
THEN CreditCardInsurance=Yes
Sex
male /
\ female
/
\
watchPromo
CCI=no
no/
\yes
/
CCI=no
income
/30-40K
CCI=yes
c. In predicting CreditCardInsurance, why is the entropy=0 bits for Salary=”50-60K”?
both instances map to ‘no’
d. In predicting CreditCardInsurance, the entropy for Salary=”30-40K” is expressed as
info([ _3__ , __2_ ]) = entropy ( ___3_/_5__ , _2__/_5__) There are 3different numbers in these 6 blanks.
IT 241
Information Discovery and Architecture Exam 3
Page 4
IT 241
Information Discovery and Architecture Exam 3
Page 5
8. Association Rules.
[15 pts]
a. Using the credit card data from the previous page, identify 5 single items sets would be generated with a
coverage threshold of 33%? (exclude the age attribute)
single item sets
Number of items
A. Income = 30-40K
5
B. Mag Promo = Yes
8
C. Mag Promo = No
7
D. Watch Promo = Yes
8
E. Watch promo = no
7
LifeIns = yes 9
b. What pairings of your 5 item sets A-E, if any, also meet the 33% coverage threshold?
A&B5
c. If you had the pairing (which you may not necessarily have) of MagazinePromo=Yes and
LifeInsPromo=Yes, what two rules could be expressed? And then, calculate their coverage as a ratio.
i. IF ___MagPromo=yes_____ THEN ____LifeInsPromo=yes____ (__6_ / __8__)
ii. IF ____LifeInsPrem-yes________ THEN ___MagPromo=yes_________ (__6_ / __9__)
9. K-means algorithm.
[8 pts]
a. If you were to apply the cluster K-means algorithm to the credit card data set, what might you first do with the
salary and sex attributes?
convert to numeric
b. Why would you, or not, include the creditCardIns attribute?
Do not use because of low coverage
[7 pts]
10. True/false.
__T__.A data cube for data mining is created by multiple joins of tables from the operational and/or archival database.
__F__.A data cube contains typically a mix of historical data and transactions in progress.
__ T__ A data cube will likely contain repetitive information.
___T__ A pivot table is Excel’s form of a data cube.
___T__ Drill down and roll up are complementary operations on a data cube.
___T__.Linear regression requires all attributes to be numerical.
__ F__.Linear regression modeling maximizes the distance of each instance to the calculated linear regression line.