Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Master(Science) 2005: (1a) Explain the meaning of data mining. Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data (1b) Describe briefly the KDD (Knowledge Discovery in Databases) Process. Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge (1c) List 6 sample methods commonly used in data mining. Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion 3b. Briefly outline the major steps of decision tree classification. Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 4a. The k-Means algorithm relies on iterating between two steps. List these two steps succinctly? Assign each data point to the closest cluster centre (centroid). That data point is now a member of that cluster. Calculate the new cluster centre (the geometric average of all the members of a certain cluster). 4b. Name two computational limitations of the k-Means clustering algorithm? Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes 5b. The association rule partition (not apriori) method divides the data set to mine into p sets (D1 … DP) applying a modified apriori algorithm to each. What frequent itemset property does this algorithm exploit? Any subset of a frequent itemset must be frequent Not answered: 3a. What is the difference between discrimination and classification? Between characterization and clustering? Between classification and prediction? For each of these pairs of tasks, how they are similar? 4c . Consider applying k-Means to a dataset that consists only of binary variables. How could you calculate distances between a centroid and an instance. 4d. What is the objective function of the k-Means algorithm.? 4e. Name one way that the clusters found by the agglomerative clustering algorithm differ to those found by the k-means clustering algorithm? 5a. A fundamental assumption of basic classification algorithms is that the training and test set data distributions are stationary. What does this mean? 5c. For a given dataset where minimum support is and minimum confidence is an association rule algorithm finds the association rule AB and BC. Write these association rules as bounded conditional and joint probabilities? Master(Science) 2006: Question 2: 1- What is meant by an ‘outlier’? Of the following set of values, which is the outlier? {0, 0.2, 0.5, 0.6,−0.1, 42, 0.67}. Outlier: Data object that does not comply with the general behavior of the data. Outlier value is :42. 2. What is the purpose of a ‘test set’? Give one advantage and one disadvantage of using a large test set. Test set is used to estimate the accuracy of the classification rules. Accuracy rate is the percentage of test set samples that are correctly classified by the model 3. What is the difference between ‘supervised’ and ‘unsupervised’ learning? Name an unsupervised learning algorithm and give an example of how it can be used in practical applications. Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 4. Explain the difference between ‘nominal’ and ‘continuous’ attributes. Give TWO examples of each type of attribute. Nominal Variables: A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Continuous Ordinal Variables: e.g., gold, silver, bronze