Download DATAMINING - E

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
NEHRU ARTS AND SCIENCE COLLEGE
T.M PALAYAM, COIMBATORE
PG & RESEARCH DEPARTMENT OF COMPUTER SCIENCE
QUESTION BANK
CLASS: III B. Sc (CS)
SUBJECT NAME: DATA MINING
UNIT-1
SECTION-A
ONE MARKS:
1. The Data accessed is usually a different version from that of the original
operational database.
a) Query
b) Data
c) Output
d) Model
2. The Output of the data mining query probably is not a subset of the database.
a) Query
b) Data
c) Output
d) Model
3. A Predictive Model makes a prediction about values of data using known
results found from different data.
a) Predictive model
b) Descriptive model c) Both a& b
d) None
4. A Descriptive model identifies patterns or relationships in data.
a) Predictive model
b) Descriptive model c) Both a& b
d) None
5. Classification maps data into predefined groups or classes.
a) Classification
b) Regression
c) Prediction
d) Time series
6. A Regression is used to map a data item to a real valued predication variable.
a) Classification
b) Regression
c) Prediction
d) Time series
7. Clustering is similar to classification except that the groups are not predefined.
a) Regression
b) Clustering
c) Association
d) Summarization
8. A Summarization maps data into subsets with associated simple descriptions.
a) Query
b) Model
c) Summarization
d) Association
9. Link analysis is alternatively referred to as Affinity analysis.
a) Clustering
b) Prediction
c) Link analysis d) None
10. Sequential analysis is also known as Sequence discovery.
a) Selection
b) Sequence analysis c) Preprocessing d) Data mining
11. Both a & b is used to determine sequential patterns in data.
a) Sequence analysis
b) Sequence discovery c) Both a & b d) None
12. KDD stands for Knowledge Discovery in Databases.
a) Knowledge Discovery in Databases
b) Knowledge Detection in Databases
c) Knowledge Discovery in Data mining
d) Knowledge Domain in Databases
13. KDD is the process of finding useful information and patterns in data.
a) CAD
b) DTD
c) KDD
d) CD
14. The KDD consists of 5 steps.
a) 3
b) 4
c) 5
d) 6
15. The data needed for the data mining process may be obtained from many
different & heterogeneous data sources is Selection.
a) Transformation
b) Data mining
c) Selection
d) Evaluation
16. The data from different sources must be converted into a common format for
processing is Transformation.
a) Transformation
b) Data mining
c) Selection
d) Evaluation
17. The data to be used by the process may have incorrect or missing data is Pre processing.
a) Transformation
b) Data mining
c) Selection
d) Pre-processing
18. Visualization refers to the visual presentation of data.
a) Graphical
b) Icon based c) Visualization
d) Pixel based
19. Geometric techniques include the box plot and scatter diagram techniques.
a) Graphical
b) Icon based c) Visualization
d) Geometric
20. Some attributes in the database might not be of interest to the data mining task
being developed is Irrelevant data.
a) Missing data
b) Irrelevant data
c) Multimedia data d) None
21. A Conventional database scheme may be composed of many different
attributes is High dimensionality.
a) High dimensionality b) Low dimensionality c) Medium Dimensionality
d) All of these
22. Outliers often many data entries that do not fit nicely into derived model.
a) Large dataset
b) Outliers
c) Selection
d) Application
23. A large database can be viewed as using Approximation.
a) Large dataset
b) Outliers
c) Selection
d) Approximation
24. A segmentation a database is partitioned into disjoined groupings of similar
tuples called Segments.
a) Segments
b) Association c) Dimensional
d) Outliers
25. Data mining can consists of 3 parts.
a) 3 b) 4 c) 5 d) 6
SECTION-B
5 MARKS:
1. Write a short note on Data mining Vs Knowledge discovery in databases.
2. Write a short note on Development of Data mining.
3. Write a short note on Summarization.
4. Write a short note on Sequence Discovery.
5. Write a short note on Social implications of data mining.
SECTION-C
8 MARKS:
1. Explain in detail about Data mining from a database perspective.
2. Explain in detail about,
i) Classification
ii) Regression
iii) Time series analysis
3. Explain in detail about,
i) Predication
ii) Clustering
iii) Association Rules
4. Explain in detail about Data mining Issues.
5. Explain in detail about Data mining Metrics.
UNIT-2
SECTION-A
ONE MARKS:
1. Parametric model describe the relationship between input & output through the
use of algebraic equations.
a) Parametric model b) Non-parametric model c) Both a & b d) None
2. The squared error is often examined for a specific predication to measure
accuracy rather than to look at the average difference.
a) RMS B) Squared error c) Unbiased
d) Biased
3. RMS stands for Root Mean Square.
a) Root Mean Square
b) Root Median Square c) Range Mean Square d)
Range Median Square
4. The RMS may also be used to estimate error or as another statistic to describe a
distribution.
a) RMS B) Squared error c) Unbiased
d) Biased
5. Pointer estimation refers to the process of estimating a population parameter.
a) Parametric model
b) Non-parametric model c) Both a & b d) Pointer
estimation
6. MLE stands for Maximum Likelihood Estimate.
a) Maximum Likelihood Estimate b) Maximum Likelihood Effort c) Maximum
Likelihood Error d) Maximum Likelihood Extent
7. Expectation Maximization algorithm is an approach that solves the estimation
problem with incomplete data.
a) RMS B) Squared error c) Unbiased
d) Expectation Maximization
8. Frequency Distribution provides an even better model of data.
a) Histogram b) Frequency distribution c) Both a & b d) None
9. Hypothesis testing attempts to find a model that explains the observed data by
first creating a hypothesis.
a) Alternative hypothesis b) Hypothesis testing c) Both a & b d) None
10. Correlation can be used to evaluate the strength of a relationship between two
variables.
a) Linear b) Correlation c) Hypothesis
d) RMS
11. Linear regression assumes that a linear relationship exists between the input
data the output data.
a) Linear regression b) Correlation c) Hypothesis
d) RMS
12. A Decision tree is a predictive modeling technique used in classification tasks.
a) Decision tree
b) Correlation
c) Input database
d) Binary search
13. A Decision tree is a tree where the root and each internal node is labeled with
a question.
a) Input tree
b) Output tree
c) Decision tree
d) All of these
14. Decision tree consists of 3 parts.
a) 2
b) 3 c) 4
d) 5
15. Neural networks is also known as Artificial Neural Networks.
a) Artificial Neural Networks
b) Artificial Neural data
c) Artificial
Network data
d) Artificial Neural interface
16. ANN stands for Artificial Neural Networks.
a) a) Artificial Neural Networks
b) Artificial Neural data
c) Artificial
Network data
d) Artificial Neural interface
17. A neural network consists of 3 parts.
a) 2 b) 3 c) 4 d) 5
18. An activation function may also known as Firing rule.
a) Firing rule b) Threshold c) Linear d) All of these
19. An activation function is sometimes called a Both a & b.
a) Processing element function b) Squashing function c) Both a& b d) None
20. The linear threshold function also called a Both a & b.
a) Ramp function b) Piecewise function c) Both a & b d) None
21. Genetic Algorithm are examples of evolutionary computing methods are
optimization type algorithms.
a) Gaussian law
b) Genetic algorithm c) Hyperbolic tangent d) None
22. A Genetic algorithm is a computational model consisting of 5 parts.
a) 3 b) 4 c) 5 d) 6
23. The precise algorithm that indicates how to combine the given set of
individuals to produce new once is crossover algorithm.
a) Crossover algorithm
b) Genetic algorithm
c) Hyperbolic tangent d)
None
24. A Linear activation function produces a linear output value based on the input.
a) Linear
b) Threshold
c) Activation
d) Genetic algorithm
25. A neural network consists of 2 parts.
a) 2 b) 3
c) 4
d) 5
SECTION-B
5 MARKS:
1. Write a short note on Point estimation.
2. Write a short note on Models based on summarization.
3. Write a short note on Bayes Theorem.
4. Write a short note on Hypothesis Testing.
5. Write a short note on Regression & Correlation.
SECTION-C
8 MARKS:
1. Explain in detail about Similarity measures.
2. Explain in detail about Decision trees.
3. Explain in detail about neural networks.
4. Explain in detail about Activation functions.
5. Explain in detail about Genetic algorithms.
UNIT-3
SECTION-A
ONE MARKS:
1. Regression problems deal with estimation of an output value based on input
values.
a) Classification b) Data Mining c) Regression d) Statistical
2. ROC Stands for Both a & b.
a) Relative Operating Characteristic b) Receiver Operating Characteristic
c) Both a & b
d) None
3. KNN Stands for K Nearest Neighbors.
a) K nearest Neighbors b) K Notification Neighbors c) K Notation Neighbors
d) None
4. CART is a technique that generates a binary decision tree.
a) KNN b) CART c) ROC
d) RRC
5. RBF Stands for Both a & b.
a) Radial Function b) Radial Basis Function c) Both a & b d) None
6. RBF is a class of functions whose value decreases with the distance from a
central point.
a) RBF b) KNN c) CART d) ROC
7. A Perceptrons is a single neuron with multiple inputs & one output.
a) Perceptrons
b) Rule based algorithm c) Generating Rules d) None
8. Multiple Independent approaches can be applied to a classification problem.
a) Multiple Dependent b) Multiple Independent c) Both a & b d) None
9. DCS Stands for Dynamic Classifier Selection.
a) Data Classifier Selection
b) Date Class Selection
c) Dynamic Classifier Selection
d) Dynamic Class Selection.
10. AVC Stands for Attribute Value Class.
a) Attribute Value Class
b) Attribute Virtual Class
c) Attribute Virtual Collections
d) Attribute Value Collections.
11. CART Stands for Classification & Regression Trees.
a) Class & Regression Trees
b) Classification & Regression Trees
c) Class & Rotational Trees
d) Classification & Rotational Trees
12. A Subtree is replaced by a leaf node if this replacement results in an error rate
close to that of the original tree.
a) Selection Tree
b) Sub Tree
c) Regression Tree d) None
13. ID3 technique to building a decision tree is based on information theory &
attempt to minimize the expected number of comparison.
a) ID2
b) ID3
c) Both a & b
d) None
14. A tuple is classified based on the region into which it falls.
a) Tuple
b) Decision Tree c) Sub Tree d) Classification
15. The data are divided into regions based on class is Division.
a) Division b) Prediction
c) tuple
d) Tree
16. The formulas are generated to predict the output class value is Prediction.
a) Division b) Prediction
c) tuple
d) Tree
17. Classification accuracy is usually calculated by determining the percentage of
tuples placed in the correct class.
a) Classification
b) Division
c) Trees
d) Prediction
18. Missing Data values cause problems during both the training phase & to the
classification process.
a) Decision tree
b) Missing tree
c) Classification tree
d) prediction tree
19. Missing Data is the training data must be handled & may produce an
inaccurate result.
a) Decision tree
b) Missing tree
c) Classification tree
d) prediction tree
20. There are 3 methods used to solve the classification problem.
a) 2 b) 3
c) 4 d) 5
21. The Logistic curve gives a value between 0 & 1 so it can be interpreted as the
probability of class membership.
a) Plain curve
b) Logistic curve
c) Linear curve
d) Non-linear curve
22. Regression can be used to perform 2 approaches.
a) 2 b) 3
c) 4
d) 5
23. The common classification scheme based on the use of distance measures is
KNN.
a) KNN
b) CART
c) SRT
d) ROC
24. The classification problem using decision trees is 2 processes.
a) 2
b) 3
c) 4
d) 5
25. Pruning remove redundant comparison or remove sub trees to achieve better
performance.
a) Pruning
b) KNN
c) Training tree
d) Decision tree
SECTION-B
5 MARKS:
1. Write a short note on Issues in classification.
2. Write a short note on Regression.
3. Write a short note on Bayesian classification.
4. Write a short note on Simple approach.
5. Write a short note on K Nearest neighbors
SECTION-C
8 MARKS:
1. Explain in detail about Decision tree based algorithm.
2. Explain in detail about,
i) ID3
ii) C4.5
3. Explain in detail about Neural Network based algorithms.
4. Explain in detail about,
i) CART
ii) Scalable DT techniques
5. Explain in detail about Rule based Algorithm.
UNIT-4
SECTION-A
ONE MARKS:
1. Clustering is similar to classification in that data are grouped.
a) Records b) Clustering
c) Grouping d) Database Segmentation
2. Dynamic data in the data base implies that cluster membership may change
over time.
a) Static data
b) Dynamic data
c) Both a & b d) None
3. Outliers are sample points with values much different from those of the
remaining set of data.
a) Outliers
b) Hierarchical data c) Static data d) Dynamic data
4. Outlier detection is also known as Outlier Mining.
a) Outlier Method b) Outlier Mining c) Outlier Methodology d) None
5. Outlier detection is the process of identifying outliers in a set of data.
a) Outlier Method b) Outlier Mining c) Outlier Methodology d) None
6. Agglomerative Algorithm start with each individual item in its own cluster &
iteratively merge clusters.
a) Agglomerative Algorithm
b) Divisive Algorithm
c) Partitional Algorithm
d) Clustering Algorithm
7. Single link technique is based on the idea of finding maximal connected
components in a graph.
a) Single link
b) Multi link
c) Scatter link
d) Partition link
8. MST Stands for Minimum Spanning Tree.
a) Minimum Spanning Tree
b) Minimum Spanning Task
c) Minimum Spanning Technique
d) Minimum Spanning Tendency
9. A Clique is a maximal graph in which there is an edge between any two
vertices.
a) Clique b) Outliers
c) Mining
d) Spanning tree
10. Partitional clustering creates the cluster in one step as opposed to several
steps.
a) Agglomerative Algorithm
b) Divisive Algorithm
c) Partitional Algorithm
d) Clustering Algorithm
11. K-Means clustering is an iterative clustering algorithm in which items are
moved among sets of clusters.
a) K-Means clustering
b) Agglomerative Algorithm
c) Divisive Algorithm
c) Partitional Algorithm
12. PAM Stands for Partitioning Around Medoids.
a) Problem Around Medoids
b) Problem Associate Medoids
c) Partitioning Around Medoids
d) Partitioning Around Methods
13. PAM is also known as K-Mediods algorithm.
a) a) Problem Around Medoids
b) Problem Associate Medoids
c) Partitioning Around Medoids
d) K-Medoids Algorithm
14. BEA Stands for Bond Energy Algorithm.
a) Bong energy algorithm b) Bond Estimate algorithm
c) Both a & b
d) None
15. Neural Networks use unsupervised learning attempt to find features in the data
that characterize the desired output.
a) Neural network b) Bond energy network c) Organizing map d) None
16. SOFM Stands for Self Organizing Feature Maps.
a) Self Orient Feature Maps b) Self Orient Feature Method
c) Self Orient Feature Mapping d) Service Orient Feature Maps
17. BIRCH is designed for clustering a large amount of metric data.
a) SOM b) SOFM
c) BIRCH d) BEA
18. DBSCAN is to create cluster with a minimum size & density.
a) DBSCAN
b) SOFM
c) BIRCH d) BEA
19. CURE algorithm is to handle outliers.
a) Cure
b) SOFM
c) BIRCH d) BEA
20. ROCK algorithm is divided into 3 parts.
a) 2 b) 3 c) 4 d) 5
21. ROCK is target to both Boolean data & categorical data.
a) ROCK b) SOFM
c) BIRCH d) BEA
22. SOFM is also known as SOM.
a) BIRCH b) BEA
c) SOM d) ROCK
23. Partitional algorithm is also known as Non-hierarchical algorithm.
a) Hierarchical
b) Non-hierarchical c) Both a & b d) None
24. Clustering can be divided into 4 algorithms.
a) 2 b) 3 c) 4 d) 5
25. Hierarchical algorithm can be divided into 2 algorithm.
a) 2 b) 3 c) 4 d) 5
SECTION-B
5 MARKS:
1. Write a short note on Similarity & Distance measures.
2. Write a short note on Outliers.
3. Write a short note on Bond energy algorithm.
4. Write a short note on nearest neighbor algorithm.
5. Write a short note on Minimum spanning tree.
SECTION-C
8 MARKS:
1. Explain in detail about squared error clustering algorithm.
2. Explain in detail about K-means clustering.
3. Explain in detail about PAM Algorithm.
4. Explain in detail about Hierarchical Algorithms.
5. Explain in detail about page clustering with genetic algorithms.
UNIT-5
SECTION-A
ONE MARKS:
1. Association rules are frequently used by retail stores to assist in marketing.
a) Association rule b) Large item set c) Apriori
d) None
2. Large item set is the number of occurrence is above a threshold.
a) Association rule b) Large item set c) Apriori
d) None
3. Apriori algorithm is the most well known association rule algorithm is used in
commercial products.
a) Apriori
b) Association c) Large item set d) None
4. Partitioning algorithm is able to adapt better to limited main memory.
a) Apriori
b) Association c) Large item set d) Partitioning algorithm
5. Data parallelism is also known as Task parallelism.
a) Task parallelism b) Increment parallelism c) Both a & b d)
None
6. Distributed association rule algorithm strive to parallelize either a data is known
as data parallelism.
a) Task parallelism b) Increment parallelism c) Data parallelism d) None
7. CDA Stands for Count Distribution algorithm.
a) Count Distribution algorithm b) Count Distribution attributes
c) Cost Distribution algorithm d) Cost Distribution attributes
8. Task parallelism is the candidates are partitioned & counted separately at each
processor.
a) Task parallelism b) Increment parallelism c) Both a & b d)
None
9. DDA Stands for Data Distribution Algorithm.
a) Data Distribution Algorithm
b) Domain Distribution Algorithm
c) Data Distribution Attribute
d) Digital Distribution Algorithm
10. Data Distribution Algorithm is the demonstrates task parallelism.
a) Data Distribution Algorithm
b) Domain Distribution Algorithm
c) Data Distribution Attribute
d) Digital Distribution Algorithm
11. The investigation strategy has been limits to the use of association rules for
market basket data is Data Source.
a) Task Source b) Data Source c) Both a & b d) None
12. The most common data structure used to store the candidate item sets & their
counts is Hash tree.
a) Task Source b) Data Source c) Hash tree d) All of these
13. Hash tree provide an efficient technique to store, access, & count item sets.
a) Task source b) Hash Tree c) Data source d) None
14. Incremental updating approaches have addressed the issues of how to modify
the association rules are performed in the database.
a) Single level b) Multi level c) Incremental updating d) Apriori
15. A variation of generation rules is Multiple-level association rules.
a) Single level b) Multiple level c) Incremental updating d) Apriori
16. A quantitative association rule is one that involves Both a & b.
a) categorical b) Quantitative data c) Both a & b d) None
17. A Correlation Rules is defined as a set of item set that are correlated.
a) Correlation rules b) Correlation task c) Correlation task d) Data Mining
18. The problem in the multiple minimum supports is Rare item problem.
a) Rare item b) Data item c) Categorical d) All of these.
19. Multiple-level is the item set may occur from any level in the hierarchy.
a) Single level b) Multiple level c) Incremental updating d) Apriori
20. Sampling algorithm facilitate counting of item sets with large databases.
a) Task source b) Hash Tree c) Data source d) Sampling
21. An algorithm Apriori-gen is used to generate the candidate item set for each
pass.
a) Apriori-gen
b) Association c) Large item set d) Partitioning algorithm
22. The item sets are also said to be Downward Closed.
a) Upward closed b) Downward closed c) Both a & b d) None
23. To finding large item set is quite easy but their cost is high.
a) High b) Low c) Very high d) Very low
24. A database in which an association rule is to be found is viewed as a set of
tuples.
a) Tuples b) Date item set c) Apriori d) All of these
25. Apriori algorithm is to generate candidate item sets of a particular size & the
database to count these to see if they are large.
a) Apriori b) Association c) Partitioning d) None
SECTION-B
5 MARKS:
1. Write a short note on Large Item sets.
2. Write a short note on Incremental Rules.
3. Write a short note on measuring the quality of rules.
4. Write a short note on Comparing Approaches.
5. Write a short note on Apriori algorithm.
SECTION-C
8 MARKS:
1. Explain in detail about Basic Algorithm.
2. Explain in detail about Parallel & Distributed Algorithm.
3. Explain in detail about Generalized Association Rules.
4. Explain in detail about Quantitative Association Rules.
5. Explain in detail about Multiple-level Association Rules.