Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Monica Nusskern CSIS 5420 – Data Mining Week 9 Final Exam July 26, 2005 1 Total Score 130 out of 130 Score 50 out of 50 1. (50 Points) Data pre-processing and conditioning is one of the key factors that determine whether a data mining project will be a success. For each of the following topics, describe the affect this issue can have on our data mining session and what techniques can we use to counter this problem. a. Noisy data – Often times when data mining a database, many attribute values can be erroneous. Human error is one of the many reasons that this can occur. Training set values can be changed from exactly what they should have been. This can lead to database fields becoming contradictory to specified database rules. Noisy values may cause a database to either ignore the data or to take these values into account and possibly change the values into what the database believes are the correct pattern. A major issue is that noisy values are hard to find within a large database. Techniques that can be used to counter noisy data include: data preprocessing, regression (smoothing data values), removing outliers, and combining computer and human inspection. Good b. Missing data – Data is not always available when data mining a large database. Missing data can be due to: Data wasn’t captured due to equipment malfunction Data is inconsistent with other recorded data and thus application program might have deleted the data Data was data not entered due to misunderstanding Certain data may not be considered important at the time of entry Good Missing data values need to then be inferred or estimated. Techniques to assist with eliminating missing data include: Ignore the Missing Data: This may easy but not effective when the percentage of missing values per attribute varies considerably. Good Fill in the Missing Value Manually: This can be tedious and infeasible. Good Use a Constant to Fill in the Missing Values: For example, use “unknown” as the value for all missing data Good Use the Attribute Mean to Fill in the Missing Value: This can be used if the attribute is numeric or the majority value if attribute it numeric or categorical. Good Use the Attribute Mean for all Samples Belonging to the Same Class to Fill in the Missing Value: This is often an effective means for filling in missing values. Good Monica Nusskern CSIS 5420 – Data Mining Week 9 Final Exam July 26, 2005 2 Use the Most Probable Value to Fill in the Missing Value: Examples include the Bayesian formula or a decision tree. Good c. Data normalization and scaling - Data normalization is a frequent data transformation methodology that allows the numeric data values to be converted so that all fall within a particular range. Neural networks work well when the numerical data lies between 0 and 1. This method is effective with distancebased classifiers, because fields with a wide array of values are much less likely to overshadow attributes with smaller ranges. Good Decimal scaling divides each numerical value by the same power of 10. For example, if all data values fall between –100 and 100, this assortment can be transformed to values between –1 and 1 by dividing each value by 100. Good d. Data type conversion - Categorical data cannot be processed by a variety of data mining tools such as neural networks. Converting the categorical alpha data to a numeric equivalent is a frequent data transformation methodology. Good Also numeric data may not be able to be processed with data mining. Decision trees sort numeric values then determine other values for the data. Good e. Attribute and instance selection - Some data mining techniques are not able to evaluate data with too few attributes and others cannot assess data with large attribute numbers. Determining attributes to be relevant or irrelevant is quite a big problem in data mining. Irrelevant attribute amounts affect the number of training instances necessary to construct a supervised learner model. Eliminating attributes works well with neural networks and nearest neighbor classifiers, however attribute selection has to occur before the data mining process can begin. Attributes with little predictive power may be combined with other attributes to create new attributes with a high degree of predictive capability. Good Means of completing this task include: Creating a new attribute where each value represents a ratio of the value of one attribute divided by the value of a second attribute. Good Creating a new attribute where values are differences between the values of two existing attributes. Good Creating a new attribute where values are computed as the percent decrease or increase of two or more current values. Good Score 80 out of 80 Monica Nusskern CSIS 5420 – Data Mining Week 9 Final Exam July 26, 2005 3 2. (80 points) We've covered several data mining techniques in this course. For each technique identified below, describe the technique, identify which problems it is best suited for, identify which problems it has difficulties with, and describe any issues or limitations of the technique. a. Decision trees - An initial tree is created when a common algorithm for constructing a decision tree selects a subset of instances from the training data. The exactness of this tree is examined by the other training instances. Incorrectly classified instances will be added to the current set of training data and the process continues to repeat. Decision trees are fairly easy to comprehend and are able to be mapped to a distinct group of production rules that can be easily be applied to real-world issues. Decision trees do not make prior assumptions about the data. Decision trees are able to work with both numerical and categorical data. Some issues with decision trees are output attributes must be categorical as well as multiple output attributes are not allowed. Only minor differences in training data can cause different attribute selections at each choice point within the decision tree allowing the tree to be unstable. Good b. Association rules - Association rules assist with discovering unknown relationships in larger databases. An attribute that is a precondition in one rule may appear as a consequence in another rule. Association rules permit the consequences of a rule to contain one or more values. Rule confidence is utilized and supported to assist in uncovering associations that are likely to be interesting to marketing departments. A major issue with association rules is that some discovered relationships might turn out to be trivial so a person must utilize careful interpretation of the results. Good c. K-Means algorithm - The K-Means algorithm is among the most popular clustering techniques utilized. K-Means algorithm is defined as a statistical unsupervised clustering mechanism. Utilizing only numeric input attributes, a person must make a decision concerning the number of clusters that need to be discovered. The K-Means algorithm is started by randomly determining one unique data point to represent each cluster and continues with each data instance being placed in the cluster where it is most similar to. New cluster centers are created and the cluster centers do not change. The K-Means algorithm does not assure convergence on an optimal solution. Also, it is deficient in the ability to clarify what has been formed. Finally, the K-Means algorithm is not able to determine which attributes are significant in determining the formed clusters. Good d. Linear regression - Linear regression can be an accurate method for prediction and estimation problems. It models variations in dependent variables as a linear combination with one or more independent variables. It is an effective Monica Nusskern CSIS 5420 – Data Mining Week 9 Final Exam July 26, 2005 4 data mining technique if the relationship between the dependent and the independent variable is linear and not as much so when the outcome is binary in nature. This is due to the restrictions that are placed on the value of the dependent variable that cannot be determined by the equation. Good e. Logistic regression – Logistic regression may be considered the opposite of linear regression in that logistic regression is often an excellent choice for problems having a binary result. Logistic regression is a nonlinear regression methodology that connects a conditional probability value with each instance of the data. Logistic regression allows for output values to represent a probability of class membership whereas linear regression forces output values to be between 0 and 1. Good f. Bayes classifier - Bayes classifier is a supervised classification methodology where all input attributes are assumed to be of equal importance and also independent of each other. This method can still work effectively even when the assumptions are determined to be false. A benefit of Bayes classifiers is that it may be applied to datasets that contain both categorical and numeric data as well as datasets that are missing information. Good g. Neural networks - Neural networks are defined as a group of interconnected nodes designated to imitate the function of the human brain. Neural networks can be utilized for both supervised and unsupervised clustering which is a unique characteristic. However, the input values are always be numeric. The phases of a neural network include the Learning Phase where input values are connected with each instance that enters the network at the input layer. For each input attribute contained in the data there is one input layer node. Utilizing input values and the network connection is able to determine the neural network computes for the output for each instance. The output for each instance is compared with the desired network output. Training is complete when specific number of iterations is defined or when the network meets a predetermined minimum error rate. The second phase of neural networks is when the entire network fixes the weights and then uses the results to compute output values for the new instances. Good h. Genetic algorithms - Genetic algorithms take the theory of evolution to include inductive leaning. It has the ability to be used as a supervised or as an unsupervised technique. Genetic algorithms are usually applied to problems that cannot be solved with more traditional methods. A fitness function is then applied to a data element set to specify the elements that will survive from one generation to the next. New instances are defined from those elements not surviving in order to replace deleted elements. Good Monica Nusskern CSIS 5420 – Data Mining Week 9 Final Exam July 26, 2005 Very good overview! 5