Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Total Score 120 out of 120 Karen Cegelski Mid Term Week 5 CSIS 5420 June 30, 2005 Score 20 out of 20 1. In a supervised learner model we use a training dataset and a test dataset. The training dataset is used to train and refine the model. The test dataset represents any data we will submit to the model for classification. What constraints must we require that the training dataset conform to so that any test data will be classified correctly? Poor test set accuracy may lie in problems with the training data. If the model is built with training data that does not contain all possible domain instances or contains an abundance of atypical instances then it is not likely to perform well. Randomly selecting training data ensuring that all classes contained in the training data are equally distributed, as they would be seen in the general population is the best preventative method toward ensuring test set accuracy. This procedure for is known as stratification. Another useful technique involves careful examination of training instance typicality scores. Good Training set data must be proportionate to the test set data: models built with training data that does not represent the set of all possible domain instances can lead to misclassified test data (215). Redundant attributes must be addressed: redundant attributes can certainly wreak havoc upon output results (228). Missing values must be addressed: in the data preprocessing phase decisions need to be made as to how to handle missing or null attribute values on an entry by entry level (155). Input must be meaningful and error minimized: we have a variety of error checking and process simplifying mechanisms at our disposal (129). Score 20 out of 20 2. In unsupervised clustering we adjust, or the data-mining tool adjusts, a cluster membership fitness threshold value. This value determines whether a piece of data will belong, or not belong, to a particular cluster. Tuning this value adjusts the number of clusters a model will partition the training data into. Why is it important to tune the number of clusters that our unsupervised clustering model will create? Without a dependent variable to guide the learning process, we must rely on the learning program to build a knowledge structure by using some measure of cluster quality to group instances into two or more classes. The primary goal of unsupervised clustering is to discover concept structure in the data. Common uses of unsupervised clustering include: Determine if meaningful relationships in the form of concepts can be found in the data Good Evaluate the likely performance of a supervised learner model Good Determine a best set of input attributes for supervised learning Good Detect outliers. Good Varying the number of clusters can assist in detecting any atypical instances present in the data by examining those instances that do not group naturally with other instances. Good. The main goal of unsupervised clustering is to create K clusters where the entries within each cluster are very similar, but the clusters are different from one another. Score 20 out of 20 3. The K-Means algorithm has some issues that can cause poor or incorrect results. What are these issues? What data preprocessing can we use to either detect or avoid each issue? This method is early to understand and implement. However, several issues to be considered are: The algorithm only works with real-valued data. Categorical attributes in the data set must either be discarded or converted to a numerical equivalent. Good A value for the number of clusters formed is required. Making a poor choice is an obvious problem. Running the algorithm several times using different values for K can assist in determining how many clusters may be present in the data. Good K-Means algorithm works best when all the clusters that exist in the data set are approximately the same size. Clusters of unequal size mean that the K-Means algorithm may not find the best solution. Good There is not way to determine which attributes are significant in determining the formed clusters. Irrelevant attributes can cause less than optimal results. Good We are responsible for interpretation about what has been found due to the lack of explanation about the nature of the formed clusters. Supervised data mining tools can help gain insight into the nature of the clusters formed by unsupervised clustering algorithms. Good Score 20 out of 20 4. Genetic algorithms are powerful tools but they have some issues that affect their results or their performance. What are these issues? What methodologies can we use to either mitigate or avoid each issue? Genetic algorithms are gaining in popularity; however, few commercial data mining products contain this component. They can be used for both supervised and unsupervised learning and is typically used in situations where traditional techniques will not solve the problem. However, using this problem solving approach has other issues you need to be aware of: Genetic algorithms are designed to locate globally optimized solutions. There are no guarantees that any given solution is not the result of a local rather than a global optimization. Properly handling the fitness function can be used to ensure global optimization. Good The fitness function determines the computational complexity of a genetic algorithm. If the fitness function involves several calculations, it can be computationally expensive. Operating on dynamic data sets is difficult, as early convergeance towards solutions may no longer be valid for later data. A possible remedy would be increasing genetic diversity and attempt to prevent early convergence, by either increasing the probability of mutation when the solution quality drops or by occasionally introducing new, randomly generated elements into the gene pool Good GAs can not effectively solve problems in which there is no way to judge the fitness of an answer other than right/wrong, as there is no way to converge on the solution. Good Transforming the data to a form suitable for a genetic algorithm can be challenging. Good Score 20 out of 20 5. The quality of a data mining model is highly dependent on the data used to train or develop the model. Discuss the techniques used to preprocess our data prior to data mining. Data preprocessing begins with data cleaning. This involves accounting for noisy data and dealing with missing information. Ideally data preprocessing would take place prior to the data being stored permanently in the data warehouse. Good Noisy data refers to random errors in the attribute values. In large datasets, noise can be found in many shapes and forms. These concerns can include: 1) How to we find duplicate records?; 2) How can we locate incorrect attribute values?; 3) What data smoothing operations should be applied to the data?; and 4) How can we find and process outliers?. Good Missing data items can be dealt with in various ways. Frequently, missing attribute values indicate the information is lost. Most data mining techniques require all attributes contain a value. Some solutions that for dealing with missing data prior to being presented to the data mining algorithm are: 1) Discard records with missing data – this should only be used when a small percentage of the total number of instances contain missing data; 2) For real-value data, replace missing values with the class mean – this is an option for numerical attributes; and 3) Replace missing attribute values with the values found within other highly similar instances – this method can be used for either categorical or numeric attributes. Good Some techniques allow for instances to contain missing values. Three ways that these techniques deal with the missing data are: 1) Ignore the missing values – neural networks and Bayes classifier use this approach; 2) Treat missing values as equal comparisons – the approach can present problems with very noisy data in that instances may appear to be alike when in reality they are not; 3) Treat missing values as unequal comparisons – in this case two similar instances containing missing values may appear dissimilar. Good Score 20 out of 20 6. Neural networks have three significant weaknesses. What are they? Give an example that demonstrates each weakness. How can we mitigate each weakness? One of the biggest criticism of neural networks is they lack the ability to explain their behavior. In areas where explaining rules is important, such as denying loan applications, neural networks should not be the tool of choice. Neural Networks should be used when acting on the results is more important than understanding them. Good The learning algorithms associated with neural networks are not guaranteed to converge to an optimal solution. This problem can be dealt with by manipulating various learning parameters. The inputs to a neural network can be massaged to be in a particular range, usually between 0 and 1. Sometimes this choice can have an effect on the results. Massaging the data requires analyzing the training set to verify the data values and their ranges. Since data quality is the number one issue in data mining, the necessity of this step can possibly prevent problems later in the analysis. Good Neural networks can be over trained to where they will work well on training data, but poorly on test data. Test set performance must be consistently monitored in order to watch for this problem. Good