Download BIS 541 Data Mining and Knowledge Acquisition

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Predictive analytics wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Name _____________________
Grade: ____+____+____+____+____+____+____=____
BIS 541 Data Mining and Knowledge Acquisition
/Midterm/Final Questions
Solve each question on different side of a page
Show all your work,
No exchange of calculators
1. Given the data set:
Income
age
low
young
high
young
high
young
low
mid
low
old
high
old
high
young
high
mid
low
young
Class
Y
Y
Y
Y
N
N
N
N
N
Where Y for yes, N for no
a) Given the new inputs X: income= low, ager= young,how do you classify the new
data point using Bayesian Classification?
b) Describe a feedforward neural network structure for this classification problem:
indicating how the encode the inputs and the output variables are performed, hidden
node and transfer functions in the hidden nodes.
2 A retail company is aiming at performing a segmentation study of its customers.
a) If the tool has a k-means algorithm for the segmentation purpose. What are
theadventages and disadventages of having this algorithm at hand What data
preprocessing actions are required?
b) Suppose the number of data points is too large. If a hierarchical clustering is
required by the management, describe how to combine k-means and hierarchical
algorithms.so as to eliminate the disadvantages of both methods
3. Construct a data set that generates the tree shown below by the ID3 algorithm. Here N
umber of data points
Node 2
A=a1
Decision Y
N: 4


Node 3
A=a2
N:4


Node 4
B=b1
Decision No
N: 2


Node 5
B=b2
Decision is Yes
N:2


2. (35 points) Consider a classification problem in that customers that are taking consumer
credits from a bank are classified into three risk groups The input variables are age:
discretized into 4 groups, income into 4 groups, education into four groups, gender, number of
months the customer is dealing with the bank and average delay of payments in months, and
current value of the accont balance. The output variable has 3 categories as risky, normal or
highly risky calculated by some procedure and provided to the data miner. Design an
encoding schema for the input and output variables so that the problem will be solved by a
neural network Show a typical topology of a feedforward network architecture
3. (35 points) Consider a shipment company responsible for shipping items from one location
to another on predetermined due dates. Design a star schema OLAP cube for this problem to
be used by managers for decision making purposes. The dimensions are time, item to be
shipped, person responsible for shipping the item, location.. For each of these dimensions
determine three levels in the concept hierarchy. Design the fact table with appropriate
measures:and keys (include two measure and at least one calculated member in the fact table)
Show one drilldown and role up operations
1. A data warehouse is constructed for the library of a university to be used as a multipurpose DSS. Suppose this warehouse consists of the following dimensions: user ,
books , time (time_ID, year, quarter, month, week, academic year, semester, day), and
. “Week” is considered not to be less than “month”. Each academic semester starts
and ends at the beginning and end of a week respectively. Hence, week<semester.
a. Describe concept hierarchies for the three dimensions. Construct meaningfull
attributes for each dimension tables above . Describe at least two meaningfull
measures in the fact table. Each dimension can be looked at its ALL level as
well.
b. Describe three meaningfull OLAP queries and write sql expresions for one of
them.
2. An E-commerse company is aiming at performing a segmentation study of its visitors
so as to foree visitors to stay and make orders There is a concept hierarcy for products
that can be followed by visitors. At the end of hierarcy there are products that can be
looked or ordered Each session may end up with an order.
If the tool has a k-means algorithm for the segmentation purpose. What are the
adventages and disadventages of having this algorithm at hand What data pre
processing actions are required (missing value handling data transformations)
3. A churn model is to be developed for the customers of a telecominication
company.aiming at perdicting whether a customer in the next month is a churner or
not. Customers voluntarily leave the company and 1% of the customers in each month
churns. Relevant data is available in three tables: a customer table holding personal
information about customers, a billing table holding payment charactherisitics of
customers, a calling table holding summersy of calling patterns
a) propose two variables from each table that you thinkimpotant in explaiining churn.
b) what are the characteristics of these variables
c) What are the reasons for missing values? and How do you handle them?
d) What are possible data inconsistencies
e) Do you make any discritization
f) Do you make any data transformationdo you apply any data reduction strategies
g) Define your target and input variables in classification. Which classification
techniques and algorithms do you use in solving the classification problem? Support your
answer
h) Which functionality of data mining is appropriate for this problem, which algorithm
would you suggest. What are the adventages and disadventages of rhe algorithm you
porpose
a. Define your variables indicating their categories in clustering Which clustering
techniques and algorithms do you use in solving the clustering problem?
Support your answer.
1. 1.(Han page problem 328 problem 7,6) The follwing table consists of training data from
an employee database. For a given raw entry, count represents the number of data tuples
having the values for department, status, age and salary given in that row.
Predicted variable is status Age,Salary and Department are inputs
Department
Status
Age
Salary
Sales
Senior
31-35
46K-50K
Sales
Junior
26-30
26K-30K
Sales
Junior
31-35
31K-35K
Systems
Junior
21-25
46K-50K
Systems
Senior
31-35
66K-70K
Systems
Junior
26-30
46K-50K
Systems
Senior
41-45
66K-70K
Marketing
Senior
36-40
46K-50K
Marketing
Junior
31-35
41K-45K
Secretary
Senior
46-50
36K-40K
Secretary
Junior
26-30
26K-30K
a) Solve this classification problem using ID3 algorithm. Where status is the output
variable
b) Design a multilayer feedforward neural network for the given data. Label the noedes in
the input and output layers. Describe how you encode the input and output variables,
specifiy the parameters of the network that can be changed
2. For each of the following problem identify relevant data mining tasks
a) A stock market analyst is asked to calculate the likely change in stock price for a set
of companies with similar price/earning ratios
b) A political strategist is seeking the best groups to canvass for donations in a
particular country
c) A defense computer must immediately decide whether a blip on the radar is a flock
of geese or an incoming nuclear missile
d) A homeland security official would like to determine whether a certain sequence of
financial or residential moves implies a tendency for terrorist acts.
3. A retail company asked to segment its customers. Flowing variables are available for
each customer: age,income, gender numer of children, occupation, houseowner, have a car or
not. There are 6 category of goods sold by the company and total purchases from each
category is available for each customer, in addition average inter purchase time is also
included in the database.
a) What are the types of these variables
b) if your tool has only k-means algorithm which of these variables are more suitable
for the segmentation problem.?
c) What data transformations are applied?
d) How do you reduce number of variables used in the analysis?
e) How do you determine number of customer segments?
f) How do you measure similarity between occupation and gender?
g) If you want to include categorical variables into your clustering, How would you treat
1. (20 pts) For each of the following problem identify relevant data mining tasks with a
brief explanation
a) A weather analyst is interested in wheather the temperature will be up or down for
the coming day
b) An insurance analyst intends to group policy holders according to characteristics of
customers and policies
c) A medical researcher is looking for symptoms that are occurring together among a
large set of pationes.
d) An educational program director would like to determine likely GPA of applicant to
a MA program from their ALES scores, undergraduate GPAs and enterence exam scores.
2. (20 pts) Develop a data warehouse for a weather bureau having so many probes
located all over a large region, using star scheme. These probes collect basic weather data
such as temperature , air pressure , humidity,… at each hour. All the data is sent to a
central station to be processed. .
a) design the fact table : keys and measures
b) design the dimension tables their concept hierarchies
c) state two questions that can be answered by querying the warehouse.
d) show one roll up and one drill down operation abour one of these questions
3. (20 pts) Evaluate the four classification methods: decision threes, neural networks,
Bayesian classification and k-NN in terms of
a) accuricy
b) speed of model development and use
c) understandability and interpretability of output
d) handling of outlayers if not handled in preprocessing step
4. (20 pts) The questions about constaint-based association rule mining
The price of each item is nonnegative For the following cases indicate the type of
constraints (monotonic, anti-monotonic or none)
a) the sum of prices of items is less then or equal to 10
b) the average price of items is less then or equal to 20
5. (20 pts) Based on a sample of 30 observations the population regression model
Y i = 0+ 1x i + i
The least square estimates of intercept is 10.0
Sum of the values of dependent and independent variables are 450 and 150 respectively.
Estimated variance of dependent variable is 25, variance of the residuals is 4
a) What is the least square estimate of slope coefficient? Interpret the figure.
b) What are the values of SSR and SSE?
c) Find and interpret the coefficient of determination.
d) Test the null hypothesis that the explanatory variable X does not have a significant
effect on Y at confidence level of 95%.Critical value of F=0.05(1,28) = 4.20