Download Review Questions

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining and
Knowledge Acquizition
— Chapter 7 —
—Data Mining Overview
and Exam Questions—
2014/2015 Summer
1
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
2
Methodology and Overview


KDD Methodology
Functionalities
3
KDD Methodology

Methodology
 Problem definition
 Data set selection
 Preprocessing transformations
 Functionalities



Classification/numerical prediction
Clustering
Frequent Pattern Mining



Association
Sequential analysis
others
4
KDD Methodology (cont.)

Algorithms
 For classification you can use


For clustering you can use





Decision trees ID3,C4.5 CHAID are algorithms
Partitioning methods k-means,k-medoids
Hierarchical AGNES
Probabilistic EM is an algorithm
Presenting results
 Back transformations
 Reports
Taking action
5
Data Description

Single variables
 Categorical - Ordinal, nominal


Continuous – interval, ratio



Frequency plots, tables, Pie charts
5-summary, centeral tendency, spread
Examine the probability distribution
For two variables
 Both categorical



Cross tabulation
One categorical the other continuous
Both are continuous

correlation coeficient, scatter plots
6
Preprocessing







Missing values
Inconsistencies
Redundent data
Outliers
Data transformations
Data reduction
 Attribute elimination
 Attribute combination
 Samplinng
Histograms
7
Functionalities






Styles of Data Mining
Descriptive - OLAP
Classification
Numerical Prediction
Clustering
Frequent Pattern Mining
8
Two basic style of data mining


Descriptive
 Cross tabulations,OLAP,attribute oriented
induction,clustering,association
Predictive
 Classification,numerical prediction



Difference between classification and numerical
prediction
Questions answered by these styles
Supervised v.s. Unsupervised
9
Descriptive - OLAP





Concept of data cube
Fact table
 Measures – calculated measures
 Keys
Dimensions
Sheams
 Star, snowflake
Concept hierarchies
 Set grouping such as price age
 Parent child
 Attributes not suitable for concept hierarcies
10
Classification




Methods
 Decision trees
 Neureal networks
 Bayesian
 K-NN or model based reasoning
Adventages disadventages
Given a problem which data processing techniques are
required
Given a problem shich classification method or
algorithm is more apprpriate
11
Classification (cnt.d)

Accuracy of the model
 Measures for classification/numerical
prediction
 How to better estimate


How to improve



Holdout,cross validation, bootstraping
Bagging, boosting
For unbalanced classes
What to do with models

Lift charts
12
Numercal Prediction



Learning is supervised
Output variable is continuous
Methods
 Regression




Simple
Multiple
Most methods for classification can be used
for numerical prediction as well
Accuricy

Root mean square, absolute mean deviation
13
Clustering

Distance measures
 Dissimilarity or similarity
 For different type of variables

Ordinal,binary,nominal,ratio,interval
Why need to transform data
Partitioning methods
 K-means,k-medoids






Adventage disadventage
Hierarchical
Density based
probablistic
14
Frequent Pattern Mining






Association analysis
Apriori or FP-Growth
How to measure strongness of rules
 Support and confidence
 Other measures of interestingness critique of
support confidence
Multiple levels
Constraints
Sequential pattern mining
15
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
16
Introduction


Defineing problems
 Given a short description of an environment,
deine data mining problems fiting to different
functionalities, possible preprocessing
problems paciliur to the environment
Basic functionalities
 Given a short description of a data mining
problem, with which functionality the
problem is solved?
17
Big University Library

1. Suppose that a data warehouse for Big-University
Library consists of the following three dimensions:
users, books, time, and each dimension has four
levels not including the all level. There are three
measures: You are asked to perform a data mining
study on that warehouse (25 pnt)

Define three data mining problems on that
warehouse: involving association, classification
and clustering functionalities respectively. Clearly
state the importance of each problem. What is the
advantage of the data being organized as OLAP
cubes compared to relational table organisation?
18
Big University Library (cont.)

In data preprocessing stage of the KDD





What are the reasons for missing values? and
How do you handle them?
what are possible data inconsistencies
do you make any discritization
do you make any data transformations
do you apply any data reduction strategies
19
Big University Library (cont.)



Define your target and input variables in
classification. Which classification techniques and
algorithms do you use in solving the classification
problem? Support your answer
Define your variables indicating their categories in
clustering Which clustering techniques and
algorithms do you use in solving the clustering
problem? Support your answer.
Describe association task in detail; specifying the
algorithm interestingness measures or constraints if
any.
20
Data mining on MIS

A data warehouse for the MIS department
consists of the following four dimensions:
student, course, instructor, semester and
each dimension has five levels including the
all level. There are two measures: count and
average grade. At the lowest level of average
grade is the actual grade of a student. You
are asked to perform a data mining study on
that warehouse (25 pnt)
21
Data mining on MIS (cont.)


Define three data mining problems on that
warehouse: involving association, classification and
clustering functionalities respectively. Clearly state
the importance of each problem. What is the
advantage of the data being organized as OLAP
cubes compared to relational table organisation?
In data preprocessing stage of the KDD





What are the reasons for missing values? and How do you
handle them?
what are possible data inconsistencies
do you make any discritization
do you make any data transformations
do you apply any data reduction strategies
22
Data mining on MIS (cont.)



Define your target and input variables in
classification. Which classification techniques and
algorithms do you use in solving the classification
problem? Support your answer
Define your variables indicating their categories in
clustering Which clustering techniques and
algorithms do you use in solving the clustering
problem? Support your answer.
Describe association task in detail; specifying the
algorithm interestingness measures or constraints if
any.
23
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
24
Data Description


How to describe single variables – categorical
and continuous
How to desribe two association between two
variables
 bnoth continuous
 both categorical
 One continous, one categorical
25
Preprocessing



What to do as preprocessing?
Which techniques are applied?
For what reason?
26
MIS 542 Midterm 2011/2012 Fall PCA



5. (10 points) Consider two continuous
variables X and Y. Generate data sets
a) where PCA (principle component analysis)
can not reduces the dimensionality from two to
one
b) where although the two variables are related
(a functional relationship exists between these
two variables), PCA is not able to reduce the
dimensionality from two to one
27
MIS 542 Final 2011/2012 Falloutliers



1 (20 points) Give two examples of outliers.
a) Where outliers are useful and essential
patterns to be mined.
b) Outliers are useless steaming from error or
noise.
28
MIS 542 Final 2011/2012 Fall
transformations

2 (20 points) Considering the classification
methods we cover in class, describe two
distinct reasons why continuous input variables
have to be normalized for classification
problems(each reason 10 points).
29
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
30
OLAP





Concept of data cube
Fact table
 Measures – calculated measures
 Keys
Dimensions
Sheams
 Star, snowflake
Concept hierarchies
 Set grouping such as price age
 Parent child
 Attributes not suitable for concept hierarcies
31
Data warehouse for library

A data warehouse is constructed for the library of a university to
be used as a multi-purpose DSS. Suppose this warehouse
consists of the following dimensions: user , books , time
(time_ID, year, quarter, month, week, academic year, semester,
day), and . “Week” is considered not to be less than “month”.
Each academic semester starts and ends at the beginning and
end of a week respectively. Hence, week<semester.

Describe concept hierarchies for the three dimensions.
Construct meaningfull attributes for each dimension tables
above . Describe at least two meaningfull measures in the
fact table. Each dimension can be looked at its ALL level as
well.

What is the total number of cuboids for the library cube?

Describe three meaningfull OLAP queries and write sql
expresions for one of them.
32
Big University

2. (Han page 100,2.4) Suppose that the data
warehouse for the Big-University consists of the
following dimensions:
student,course,instructor,semester and two measures
count and average_grade. Where at the lowset
conceptual level (for a given student, instructor,course,
and semester) the average grade measure stores teh
actual grade of the student. At higher conceptual levels
the average_grade stores the average grade for the
given combination. (when student is MIS semester
2005 all terms, course MIS 541, instructor Ahmet Ak,
average_grade is the average of students grades in thet
course by that instructer in all semester in 2005)
33
Big University (cont.)



a) draw a snawflake sheam diagram for that
warehouse
What are the concept hierarchys for the
dimensions
b) What is the total nmber of cuboids
34
MIS 542 Final 2005/2006 Spring olap




1. MIS department wants to revise academic
strategies for the following ten years. Relevent
questions are: What portion of the courese are
required or elective? What is the full time part
time distribution of instuctors? What is the
course load of instructors? What percent of
technical or managerial courses are thought by
part time instructors? How all theses things
35
MIS 542 Final S06 1 cont.





changed over years? You can add similar
stategic quustions of your own. Do not conside
students aspects of the problem for the time
being. Desing and OLAP sheam to be used as a
strategic tool. You are free to decide the
dimensions and the fact table. Describe the
concept
hierarchies, virtual dimensions and calculated
members. Finally show OLAP opperations to
answer three of such strategic questions
36
MIS 54 Final 2012/2013 Hospital






2. (20 pts) Suppose that a data warehouse for a
hospital consists of the following dimensions: time,
doctor and patient and the two measures count and
charge, where charge is the fee a doctor charge a
patient for a visit.
Design a warehouse with star schema:
a) Fact table: Design the fact table.
b) Dimension tables: For each dimension show a
reasonable concept hierarchy.
c) State two questions that can be answered by that
OLAP cube.
d) Show drilldown and roll up operations related to one
of these questions
37
Humman Resource cube



1. (25 points) In an organization a data warehouse is to be
designed for evaluating performance of employees. To evaluate
performance of an employee, survey questionnaire is consisting a
set of questions with 5 Likered scale are answered by other
employees in the same company at specified times. That is,
performance of employees are rated by other employees.
Each employee has a set of characteristics including department,
education,… Each survey is conducted at a particular date applied
to some of the employees. Questions are aimed to evaluate broad
categories of performance such as motivation, cooperation
ability,…
Typically, a question in a survey, aiming to measure a specific
attitude about an employee is evaluated by another employee
(rated f rom 1 to 5) Data is available at question level.
38
Human resource cube (cont.)





Cube design: a star schema
Fact table: Design the fact table should contain one
calculated member. What are the measures and keys?
Dimension tables: Employee, and Time are the two
essential dimensions include a Survey and Question
dimensions as well. For each dimension show a concept
hierarchy.
State three questions that can be answered by that
OLAP cube.
Show drilldown and role up operations related to these
questions
39
MIS Midterm 2008/2009 Spring Shipment



1. (20 points) Consider a shipment company
responsible for shipping items from one location to
another on predetermined due dates. Design a star
schema OLAP cube for this problem to be used by
managers for decision making purposes. The
dimensions are time, item to be shipped, person
responsible for shipping the item, location.. For each of
these dimensions determine three levels in the concept
hierarchy. Design the fact table with appropriate
measures:and keys (include two measure and at least
one calculated member in the fact table)
Show one drilldown and role up operations
Show the SQL query of one of the cuboids.
40
Outline

Clustering
41
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
42
Comparing clustering methods



Clustering methods
Partitioning, hierarchical, density based, modelbased: probabnlistic EM
Compare clustering methods
 Output
 İnterpreteation
 Sensitivity ot aoutliers
 Speed of computation
43
clustering

Construct simple data sets showing the
inadequacies of k-means clustering (20 pnt)

this algorithm is not suitable of even
spherical clusters of different sizes

What are the adventages and disadventage
of using k-means
44
clustering
1.
Consider a delivery center location decision
problem in a city where a set of related
products are to be delivered to markets
located in the city. Design an algortihm for this
lacation selection problem extending an
algortihm we cover in class. State clearly the
algorithm and its extensions.for this particular
problem.
45
Clustering preferences

Consider a popular song competition. There are N competitors
A1, A2,… AN. Number of voters is very large; a substantial
fraction of the population of the country. Each voter is able to
rank the competitors form best to worst e.g. for voter 1
(A4>A2>A3>A1) meaning that there are four competitors and
A4 is the best for voter 1 A1 being the worst. Suppose
preference data is available for a sample of n voters at the
beginning of competition.

Develop a distance measure between the preferences of two
voters i and j

Suppose you have the k-means algorithm available in a
package. Describe how you can use the k-means
algorithm to clusters voters according to their preferences.
46
MIS 542 Final 2005/2006 Spring


3. a) Describe how to modify k-means
algorithm so as to handle categorical variables
(binary, ordinal, nominal).
b) What is a disadventage of Agglomerative
hierarchical clustering method in the case of
large data. Suggest a way of eliminating this
disadventages while benefiting the adventages
of agglomerative methods
47
MIS 542 Midterm 2007/2008 Spring



Generate data set of two continuous variables X
and Y. Consider clustering based on density
When clustered with one variable there (either
X or Y) there is one cluster
When clustered with both variable there there
are two clusters
48
MIS 542 Final 2011/2012 Fall


3 a (10 points) Generate data sets for two clustering
problems with two continuous variables. Two natural
clusters for the notion of density based clustering but
the quality of these clusters are low for a partitioning
approach based on dissimilarity such as k-means
3.b (10 points) Considering the advantages and
disadvantages of partitioning and hierarchical
agglomerative clustering approaches. Design a method
for combining the two approaches to improve good
clustering quality. (Finally there are hierarchies of
clusters)
49
MIS Midterm 2011/2012 Fall


6. (25 points) A retail company asked to
segment its customers. Following variables are
available for each customer: age, income,
gender number of children, occupation, house
owner, have a car or not. There are 6 category
of goods sold by the company and total
purchases from each category is available for
each customer, in addition average
inter-purchase time is also included in the
database.
50
MIS Midterm 2011/2012 Fall





a) What are the types and scales of these variables?
b) If your tool has only k-means algorithm which of
these variables are more suitable for the segmentation
problem?
c) What data transformations are to be applied?
d) How do you reduce number of variables used in the
analysis?
e) If you want to include categorical variables into your
clustering, how would you treat them?
51
Midterm 2011/2012 Fall





In Question 3-5 artificial data sets are generated for
given situations.
3. (10 points) Consider a data set of two continuous
variables X and Y. There are two clusters (k=2)
Considering the advantages and disadvantages of
partitioning methods k-means and k-medoids of
clustering, generate two dimensional data set
a) (5 pnt) Produces almost the same clusters by kmedoids and k-means
b) (5 pnt) Produces different clusters by k-medoids and
k-means
52
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
53
Outline

Classification
 General
 Decision trees
 Neural networks
 Bayesian
 K-NN
 Accuricy Measures
54
Information gain
1.
Consider a data set of two attributes A and B.
A is continuous, whereas B is categorical,
having two values as “y” and “n”, which can
be considered as class of each observation.
When attribute A is discretized into two
equiwidth intervals no information is provided
by the class attribute B but when discretized
into three equiwidth intervals there is perfect
information provided by B. Construct a simple
dataset obeying these characteristics.
55
Decision tree

2. a-Construct a data set that generates the
tree shown below In addition the following
conditions are satisfied
Node 2
A=a1
Decision
Y
Node
3
A=a2


Node 4
B=b1
Decision N

Node 5
B=b2
Decision is Y

56
MIS 541 2012/2013 Final

1. (20 pts) Consider a decision tree with only
two branches in that the attribute selection
measure is entropy. Bearing in mind that each
candidate input attribute may have more then
two distinct values, how do you modify the ID3
algorithm to handle such a constraint on the
number of branches of the tree.
57
MIS 542 Final 2005/2006 Spring











2. Given the training data set with missing values:
A(Size)
B(color)
C(shape)
Class
small
yellow
round
A
big
yellow
round
A
big
yellow
red
A
small
red
round
A
small
black
round
B
big
black
cube
B
big
yellow
cube
B
big
black
round
B
small
yellow
cube
B
58
MIS 542 Final 2005/2006 Spring (cont.)




a) Apply the C4.5 algorithm to construct a decision
tree.
b) Given the new inputs X:size= small,color= missing,
shape=round.and Y:size= big,color= yellow,
shape=missing What is the prediction of the tree for X
and Y?
c) How do you classify the new data points given in
part b) using Bayesian Classification?
d) Analyse the possibility of pruning the tree. You can
make normal approximation to Binomial distribution
though number of observations is low. z value for upper
confidence limit of c=25% is 0.69.
59
MIS 542 Final S06 neural networks

4. Consider a classification problem with two
classes as C1 and C2. There are two numerical
input variables X1 and X2, taking values
between 0 and infinity. All observations are of
class C1, if they are above X2 = 1/X1 curve (a
hyperbola) All other observations are class C2.
Describe how multilayer perceptrons can
separate such a boundary using as few hidden
nodes as possible.
60
MIS 542 Midterm S08 2 cşass,f,cat,pm






Consider a clasification problem with two continuous
variables X and Y and a categorical output with two
distinct values C1 and C2
Generate data set such that
A) Decision trees are appropriate for clasification
B) Decision trees are not appropriate for clasification
but a perceptron can classify the data succesfully
C) Even s single perceptron is not enough to classify
the data
D) How do you encorporate a perceptron into decision
trees so that cases in B and C can be clasified by an
hybrid approach of DTs and perceptron
61
Final 2010/2011 Spring




2 (30 pt.) Consider a prediction problem; e.g. predicting
weight using height(a continuous variable) as input,
solved by neural networks. Such methods as back
propagation try to minimize the prediction error but it is
claimed that the magnitude of error depends on the
weight: a prediction error of 0.5 for a baby with a short
height should not be the same as for an adult with a
height of 2.00 meters.
a) Make a scatter plot of such a hypothetical data set
for a two variable problem.
b) Plot the prediction error on another graph
c) Do you need to modify the back propagation
algorithm so as to handle such a situation? If so explain
your modification.
62
Final 2011/2012 Fall pverf,tt,mg



4. Illustrate the over fitting of neural networks
for the following cases by generating data sets.
a) (10 points) For a binary classification
problem with two continuous inputs.
b) (10 points) For a numerical prediction
problem (output being continuous) with one
continuous input variable.
63
Midterm 2011/2012 Fall



4. (10 points) Consider a classification by a
decision tree problem. Consider a categorical
input variable A having two distinct values. The
output variable B has two distinct classes as
well. At a particular node of the tree there are
N data objects. Generate partitioning of data by
input variable A for the following
a) A does not provide any information: does
not decrease information gain at all.
b) A does provides perfect information:
decrease information gain as much as possible
64
MIS 541 2012/2013 Final

5. (20 pts) Consider a classification problem
solved by k-NN. Suppose in your dataset all
inputs are continuous variables. Why do you
need to apply any data transformations? What
data transformation is applied? Suppose the
variables are to be weighted after
transformations. Device a method for
determining optimal weights for variables s well
as determining optimal k value considering that
k-NN is a supervised learning method.
65
MIS 541 2012/2013 Final



5..(20 pts) The follwing table consists of
training data from an employee database.
Predicted variable is status. Age,Salary and
Department are inputs
Design a multilayer feedforward neural network
for the given data. Label the noedes in the
input, hidden and output layers. Describe how
you encode the input and output variables,
specifiy the parameters of the network that can
be changed by the backpropegation algorithm.
66
Department
Status
Age
Salary
Sales
Senior
31-35
46K-50K
Sales
Junior
26-30
26K-30K
Sales
Junior
31-35
31K-35K
Systems
Junior
21-25
46K-50K
Systems
Senior
31-35
66K-70K
Systems
Junior
26-30
46K-50K
Systems
Senior
41-45
66K-70K
Marketing
Senior
36-40
46K-50K
Marketing
Junior
31-35
41K-45K
Secretary
Senior
46-50
36K-40K
Secretary
Junior
26-30
26K-30K
67
Accuracy measures


For class balanjcy or unbalancy problems
Output variables with ordinary scale
 How do you modify the accuricy measure for
an ordinal output variable with three
different values
 Give an example for such a variable
68
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
69
BIS 541 2012/2013 Final II









5. Based on a sample of 30 observations the population regression
model
Y i = 0+ 1x i + i
The least square estimates of intercept is 10.0
Sum of the values of dependent and independent variables are 450
and 150 respectively.
Estimated variance of dependent variable is 25, variance of the
residuals is 4
a) What is the least square estimate of slope coefficient? Interpret
the figure.
b) What are the values of SSR and SSE?
c) Find and interpret the coefficient of determination.
d) Test the null hypothesis that the explanatory variable X does not
have a significant effect on Y at confidence level of 95%.Critical
value of F=0.05(1,28) = 4.20
70
BIS 541 2013/2014 Final





4. Based on a sample of 50 observations the population
regression model to predict number of automobile sales
(dependent variable) based on advertisement
placements (independent variable)
Y i = 0+ 1x i + i
The least square estimates of slope is 2.0
Average of the values of independent variable is 50.
Sum of the values of dependent variable is 5390.
Total sum of squares for dependent variable is 9000
Variance of the residuals is 40
71
BIS 541 2013/2014 Final




a) What is the least square estimate of
intercept coefficient? Interpret the figure.
b) Interpret the the slope coefficient.
b) What are the values of SSR and SSE?
c) Find and interpret the coefficient of
determination.
72
MIS 214 Midterm 2012/2015 Summer











5. (20 pt) An analyst want to estimate dependence of quantity
demanded of a product (Y) on its price (X1) and price of its
substitute (X2) using linear regression, based on a large sample of
data obtained from 50 weeks
Fill the missing parts in the following regression outputs (From a to
l: this letter l)
Do not report the – s but you may need their values.
Do not write on this table
R-square: f
Adjusted R-square: g
Standard error of regression: h:
SS:
d.f.
MS
F
p-value
Regression a
c
d
e
Error
b
d
2.5
Total
400
e
73
MIS 214 Final 2013/2014 Spring





1 (20 pt) For the following four scenarios, each having two cases
denoted by I and II, draw scatter plots of X (explanatory variable)
and Y (dependent variable) showing the population regression
model drawn as a line or curve as well. Use around 20-25
hypothetical points unless otherwise stated assumptions of least
square are hold. In I and II population slope and intercepts are the
same
a) In II variance of the error is higher than in I.
b) In II coefficient of determination is higher than in I.
c) In II spread of X is higher than in I.
d) In II variance of the error term increases with higher values
of X.. In I, variance of error is homoscedastic.
74
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
75
Exercise








a) Suppose A  B and B  C are strong rules
Dose this imply that A  C is also a strong rule?
b) Suppose A  C and B  C are strong rules
Dose this imply that A AND B  C is also a strong rule?
c) Suppose A  B and A  C are strong rules
Dose this imply that A  B AND C is also a strong?
d) Suppose A  B AND C is a strong rule. Dose this
imply that A  B and A  C are strong rules?
e) Suppose A AND B  C is a strong rule. Dose this
imply that A  C and B  C are strong rules?
76
Exercise



a) Suppose {A,B,C} is a frequent 3 itemset.
Dose it imply that {A,B} and {A,C} are frequent
2 itemsets?
b) Suppose {A,B}, {A,C}, and {B,C} are
frequent 2 itemsets. Dose it imply that {A,B,C}
is a frequent 3 itemset?
c) Suppose {A,B} is a frequent 2 itemset. Dose
it imply that, A  B and B  A are strong
rules?
77
Associations
1.
2.
3.
In a particular database; AC and BC are
strong association rules based on the support
confidence measure. A and B are
independent items. Does this imply that
A  BC is also a strong rule based on the lift
measure? A,B,C are items in a transaction
database.
-if A B and BC are strong. Is AC a
strong rule
-if A B and AC are strong. İs BC a
strong rule
78
MIS 542 midterm S06 association
constratint



The price of each item in a store is
nonnegative. For the following cases indicate
the type of constraints (such as: monotone,
untimonotone, tough, storngly convertable or
succinct)
a) Containing at least one Nintendo Game.
b) The average price of items is between
100 and 500.
79
BIS 541 2012/2013 Final II




4. The questions about constaint-based
association rule mining
The price of each item is nonnegative For the
following cases indicate the type of constraints
(monotonic, anti-monotonic or none)
a) the sum of prices of items is less then or
equal to 10
b) the average price of items is less then or
equal to 20
80
MIS 214 Final 2013/2015 Spring



(15 pt) Given that L4:
{(1,2,3,4),(2,4,5,6)}where 1,2,...,6 are ID s
of items.
a) Write a L3 consisting of five 3-itemsets
b) Write a C3 of seven 3-itemsets
81
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
82
BIS 541 2011/2012 Final





1. For each of the following problem identify relevant
data mining tasks
a) A weather analyst is interested in calculating
the likely change in temperatue for the coming days.
b) A marketing analyst is looking for the groups of
customers so as to apply different CRM strategies for
ecach group
c) A medical doctor must decide whether a set of
symptoms is an indication of a particular disease.
d) A educational psychologist would like to
determine exceptional students to sugget them for
special educational programs. .
83
BIS 541 2011/2012 Final




2. Develop a data warehouse for an insurance company
using fact constellations scheme. The company holds
insurance premiums paind by its customers for different
type of policies as well as the payments in case of
accidents to its customers. There are two facat tables
for premiums and payments respectively. The
dimensions are customer time, policy accident some are
sheered by the two fact tables.
a) design the fact tables : keys and measures
b) design the dimension tables their concept hierarchies
c) show one roll up and one drill down opperation
84
BIS 541 2011/2012 Final




3. Consider a customer segmentation problem
to be solved with k-means algorithm. . The
following variables are available in the dataset:
gender, member card information, total
spending in TL and education level.
a) what are the scales of these variables.?
b) How would you transform data before
applying clustering?
c) How do you find similarity/dissimilarity
between two customers?
85
BIS 541 2011/2012 Final

4. Construct a particular node of a decision tree
There are 6 data points at that node. The
output is a categorical variable with two distinct
values. Generate a dtra set of three variables
one bieing the output (Y) the others are inputs
(X1 and X2) such that X1 reduces the
information gane as much as possible whereas
X2 dose not reduces the information gain at all.
86
BIS 541 2011/2012 Final



1. Generate two different data sets of two
continuous input variables X1 and X2 for a
clustering problem.
a) that would give almost the same set of
clustering results when solved by k-means and
k-medoids
b) that would give different set of clusters
when solved by k-means and k-medoids
87
BIS 541 2011/2012 Final





2. Develop a data warehouse for holding academic performance of
an university’s faculty members. The dimensions are time (here
academic year is important but the day of the publication is a bit
detailed) faculty member, paper. For an article publiched by a
factulty member at a particular paper, number of citations
taken.and the implact factor of that paper are important. Paper
can be journal articles, conference proceedings journals can be in
SCI or SSCI and each such ournal or conference has a prestige
factor a continous variable.
a) design the fact table : keys and measures
b) design the dimension tables their concept hierarchies
c) describe in word fife different types of queries that can be
answered by the OLAP cube
d) show two roll up and two drill down operation
88
BIS 541 2011/2012 Final



3. Generate data sets for a supervised learning
problem solved by neural networks.
a) There are two continuous independent
variables X1 and X2 and a class variable with
two different values such as yes and no. On the
same artificially generatred dataset illustrate
the concept of overfitting by neural networks.
b) Illustrate the behavior of training and test
errors as the complexity of the network
increases
89
BIS 541 2011/2012 Final



4. Consider a classification problem to be solved by kNN method. The output is whether the customer will
buy a product or not. The inputs are income, age,
education level of the customer and profession of the
customer (having here distinct values)
a) Describe the data transformations needed in the
preprocessing step to prepare the datra set to be
classified by k-NN
b) How the data transformations are different from
the solution of th same problem by neural networks.
90
BIS 541 2012/2013 Final II





1 For each of the following problem identify relevant
data mining tasks with a brief explanation
a) A weather analyst is interested in wheather the
temperature will be up or down for the coming day
b) An insurance analyst intends to group policy
holders according to characteristics of customers and
policies
c) A medical researcher is looking for symptoms
that are occurring together among a large set of
pationes.
d) An educational program director would like to
determine likely GPA of applicant to a MA program from
their ALES scores, undergraduate GPAs and enterence
exam scores.
91
BIS 541 2012/2013 Final II





2. Develop a data warehouse for a weather bureau
having so many probes located all over a large region,
using star scheme. These probes collect basic weather
data such as temperature , air pressure , humidity,… at
each hour. All the data is sent to a central station to be
processed. .
a) design the fact table : keys and measures
b) design the dimension tables their concept hierarchies
c) state two questions that can be answered by
querying the warehouse.
d) show one roll up and one drill down operation abour
one of these questions
92
BIS 541 2012/2013 Final II





Evaluate the four classification methods:
decision threes, neural networks, Bayesian
classification and k-NN in terms of
a) accuricy
b) speed of model development and use
c) understandability and interpretability of
output
d) handling of outlayers if not handled in
preprocessing step
93
BIS 541 2012/2013 Final II




4. The questions about constaint-based
association rule mining
The price of each item is nonnegative For the
following cases indicate the type of constraints
(monotonic, anti-monotonic or none)
a) the sum of prices of items is less then or
equal to 10
b) the average price of items is less then or
equal to 20
94
BIS 541 2012/2013 Final II









5. Based on a sample of 30 observations the population regression
model
Y i = 0+ 1x i + i
The least square estimates of intercept is 10.0
Sum of the values of dependent and independent variables are 450
and 150 respectively.
Estimated variance of dependent variable is 25, variance of the
residuals is 4
a) What is the least square estimate of slope coefficient? Interpret
the figure.
b) What are the values of SSR and SSE?
c) Find and interpret the coefficient of determination.
d) Test the null hypothesis that the explanatory variable X does not
have a significant effect on Y at confidence level of 95%.Critical
value of F=0.05(1,28) = 4.20
95
BIS 541 2013/2014 Final





1. For each of the following problem identify relevant
data mining tasks with a brief explanation
a) A financial analyst is interested in wheather the
stock market index will be up or down for the coming
day
b) Cities in Turkey are grouped according to their voting
characteristics after the Republic of President election.
c) A security specialist is interested in determining mail
message are spam or no looking at words passing the
messages.
d) A medical doctor is interested in what symptoms
(binary variables) occur together for a specific gtype of
canser.
96
BIS 541 2013/2014 Final






2. Evaluate the four clustering methods: k-means, kmedoids, hierarchical, model-based (probalictic) in
terms of
a) handling of non-spherical shapes
b) speed of model development
c) understandability and interpretability of output
d) sensitivity to outlayers.
In each of these aspects mention only the remarkable
methods (you need not mantion all methods in all
aspects)
97
BIS 541 2013/2014 Final

3. Develop a data warehouse for the election
to selection of president of republic. There are
so many poll stations (sandık) located all over
the country. Using star scheme.. Each pool
station has valid notes for each of the three
candidates, invalid ots and total number of
voters. Each poll station has a set of lacation
related variables such as district, city,.some
characteristics of cities There is no time
dimension in this version of the problem.
98
BIS 541 2013/2014 Final




a) design a warehouse with star shame: fact
table : keys and measures and at least two
calculated measures.
b) design the dimension tables their concept
hierarchies
c) state two questions that can be answered by
querying the warehouse.
d) show one roll up and one drill down
operation abour one of these questions
99
BIS 541 2013/2014 Final





4. Based on a sample of 50 observations the population
regression model to predict number of automobile sales
(dependent variable) based on advertisement
placements (independent variable)
Y i = 0+ 1x i + i
The least square estimates of slope is 2.0
Average of the values of independent variable is 50.
Sum of the values of dependent variable is 5390.
Total sum of squares for dependent variable is 9000
Variance of the residuals is 40
100
BIS 541 2013/2014 Final




a) What is the least square estimate of
intercept coefficient? Interpret the figure.
b) Interpret the the slope coefficient.
b) What are the values of SSR and SSE?
c) Find and interpret the coefficient of
determination.
101
Outline










Methodology - Overview
Introduction
Data Description – Preprocessing
OLAP
Clustering
Classification
Numerical Prediction - Regression
Frequent Pattern Mining
Recent BIS Exams
Unclassified Questions
102

5. (25 points) Consider a data set representing the
interactions among a set of people. The degree of
interaction is a positive real number; high values can be
interpreted as, the two members are closely related
(they have close interactions such as heavy telephone
calls or mail traffic between them) In other words
rather then including the coordinates of variables
directly, the similarity/dissimilarity matrix is given. This
is a symmetric matrix. Develop an algorithm for
clustering similar objects into same clusters. Assume
that number of clusters (k) is given
103





3. (25 points) Consider a data set of two continuous
variables X and Y. X is right skewed and Y is left
skewed. Both represent measures about same quantity
(sales categories, exam grades,…)
a) Draw typical distributions of X and Y separately.
b) Draw box plots of X and Y separately.
c) Draw q-plots (quantile) of X and Y separately.
d) Draw q-q plot of X and Y.
104

4. (25 points) A strategy for clustering high
dimensional data of continuous variables is:
First apply principle components to reduce the
dimensionality of the data set and apply
clustering on the reduced form of the data.
Discuss the drawback(s) of this approach.
105
MIS 541 2012/2013 Final



1. (20 pts) Consider a data set of two
continuous variables X and Y. X both has the
same mean, both have no skewness
(symetric)ç X has a higher variance then Y.
Both represent measures about same quantity
(sales categories, exam grades,…)
a) Draw typical distributions of X and Y on
the same graph.
b) Draw box plots of X and Y separately.
106
MIS 541 2012/2013 Final

2. (20 pts) Illustrate with plots of two
continuous inputs and binary class that one
layer neural networks are enough to classify
convex class boundaries Two hidden layers are
enough to capture even non convex class
boundaries.
107
MIS 541 2012/2013 Final

3. (20 pts) Consider association rules X Y
where X is a categorical variable with more
then two values and Y is originally continuous
but discretize into categories. Give example
variables for X and Y. Illustrate that confidence
as an interestingness measure may be
misleading. Suggest a modification to the
classical confidence so as to eliminate its
drawback for this type of variables.
108
MIS 541 2012/2013 Final



4. (20 pts) The price of each item is
nonnegative For the following cases
indicate the type of constraints
(monotone, anti-monotone, tough,
strongly convertible or succinct)
a) the sum of prices of items is less then
or equal to 10
b) the average price of items is less then
or equal to 20
109
Midterm 2008/2009 Spring

2.(20) Consider a classification problem in that
customers that are taking consumer credits from a
bank are classified into three risk groups The input
variables are age: discretized into 4 groups, income into
4 groups, education into four groups, gender, number
of months the customer is dealing with the bank and
average delay of payments in months, and current
value of the accont balance. The output variable has 3
categories as risky, normal or highly risky calculated by
some procedure and provided to the data miner.
Design an encoding schema for the input and output
variables so that the problem will be solved by a neural
network Show a typical topology of a feedforward
network architecture
110
Midterm 2008/2009 Spring

3. (20 points) Consider a classification by a
decision three problem. There are two
categorical input variables A and B having two
distinct values each. The output variable C has
two distinct classes. Suppose the dataset is
suitable for using decision threes. Is the order
of selection of variables affects the
classification error? Support your answer by
generating data sets pictorially. (stoping
condition is either a pure class is obtained or no
variables remains to be tested)
111
Midterm 2008/2009 Spring






4. (20 points) Principle components is used for dimensionality
reduction then may be followed by cluster analysis – say for
segmentation purposes – Consider a two continuous variable
problem. Using scatter plots
a) Generate a data set where PCA reduces the dimensionality
from two to one
b) Generate a data set where although there is a relation between
the two variables, PCA
is not able to reduce the dimensionality to one
c) Generate a data set where there are natural clusters and PCA
can reduce the dimensionality
d) Generate a data set where there are natural clusters but PCA is
not the appropriate method for reducing the dimensionality
112