Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Nearest-neighbor chain algorithm wikipedia, lookup

Expectation–maximization algorithm wikipedia, lookup

Cluster analysis wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

K-means clustering wikipedia, lookup

Transcript
```Master(Science) 2005:
(1a)
Explain the meaning of data mining.
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
(1b)
Describe briefly the KDD (Knowledge Discovery in Databases)
Process.
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant representation
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
(1c)
List 6 sample methods commonly used in data mining.
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
3b.
Briefly outline the major steps of decision tree classification.
Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf
 There are no samples left
4a.
The k-Means algorithm relies on iterating between two steps. List these two steps succinctly?
 Assign each data point to the closest cluster centre (centroid). That data point is now a
member of that cluster.
 Calculate the new cluster centre (the geometric average of all the members of a certain
cluster).
4b.
Name two computational limitations of the k-Means clustering algorithm?
 Applicable only when mean is defined, then what about categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
5b.
The association rule partition (not apriori) method divides the data set to mine into p sets (D1 …
DP) applying a modified apriori algorithm to each. What frequent itemset property does this algorithm
exploit?
Any subset of a frequent itemset must be frequent
3a.
What is the difference between discrimination and classification? Between
characterization and clustering? Between classification and prediction? For each of these pairs
of tasks, how they are similar?
4c .
Consider applying k-Means to a dataset that consists only of binary variables. How could you
calculate distances between a centroid and an instance.
4d.
What is the objective function of the k-Means algorithm.?
4e.
Name one way that the clusters found by the agglomerative clustering algorithm differ to those
found by the k-means clustering algorithm?
5a.
A fundamental assumption of basic classification algorithms is that the training and test set data
distributions are stationary. What does this mean?
5c.
For a given dataset where minimum support is and minimum confidence is an association
rule algorithm finds the association rule AB and BC. Write these association rules as
bounded conditional and joint probabilities?
Master(Science) 2006:
Question 2:
1- What is meant by an ‘outlier’? Of the following set of values, which is the outlier?
{0, 0.2, 0.5, 0.6,−0.1, 42, 0.67}.
Outlier: Data object that does not comply with the general behavior of the data.
Outlier value is :42.
2. What is the purpose of a ‘test set’? Give one advantage and one disadvantage of using a large test set.
Test set is used to estimate the accuracy of the classification rules.
Accuracy rate is the percentage of test set samples that are correctly classified by the model
3. What is the difference between ‘supervised’ and ‘unsupervised’ learning? Name an unsupervised learning
algorithm and give an example of how it can be used in practical applications.
 Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data
4. Explain the difference between ‘nominal’ and ‘continuous’ attributes. Give TWO examples of each type of
attribute.
Nominal Variables: A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
Continuous Ordinal Variables:
e.g., gold, silver, bronze
```
Related documents