Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSIS 5420 Final Exam – Answers 1. (50 Points) Data pre-processing and conditioning is one of the key factors that determine whether a data mining project will be a success. For each of the following topics, describe the affect this issue can have on our data mining session and what techniques can we use to counter this problem. a. Noisy data b. Missing data c. Data normalization and scaling d. Data type conversion e. Attribute and instance selection A. Noisy data Noisy data can be thought of as “random error in data” (174) and is especially a problem in very large data sets. Noise can take many forms. First, true duplicate records, which may be splintered in a variety of disparate entries, causes great data noise (153). This stems from the idea that one key piece of information may be portrayed or noted in a variety of ways. John Smith, Smith, John, J. Smith, Smith, J., J.A. Smith, J. Anthony Smith, and Jonathan Smith may all describe the same person. Second, noise may attributable to mis-entered field information (154). Serious errors occur when words are present in a number only field. Finally, incorrect values that greatly differ from the range or normal set of attributes will surely cause noise. We can attack the data noise problem with a number of techniques. First, running basic statistics on a field can locate both incorrect values and entries; an unusually high variance or data type error will locate some candidate noises (154). Second, we can remove noise using data smoothing – both externally and internally. It is possible to smooth data using another data mining technique like neural networks (which perform natural smoothing in their processing) (154). Also, it is possible to remove or replace atypical data instances with mean values or null values (154). While this forces the analyst to deal with the problem of missing data, it will prevent harm being done by noisy variables. Finally, it is possible to utilize off the shelf data cleansing tools to help correct noisy data. Such tools would be invaluable in their customizable options (i.e., candidate names that mean the same thing). B. Missing data The main difficulty inherent in missing data is to figure out whether it is missing due to omission, inability to find/enter a value, or if a null value represents something tangible. Missing values are not necessarily a harmless proposition. A null value on something like sex would typically mean the value was not filled in (everyone must be M or F, whether it is filled in or not); however, a null value on a field like “veteran status” may be indicative of an unclear or unwanted answer or more. We typically can handle missing data in three distinct ways so that we may continue on with the process of data mining (155). First, we can ignore the missing values as if they have no value or weight on the outcome. Bayes classifiers are excellent at ignoring missing data; they simply build a probability value based on the complete, existing information. This, however, makes us even more reliant on the quality of the rest of the data. Second, we may try to compare and classify missing values based on existing complete entries. That is, if a complete and incomplete instance matches – except for the missing data field – then they are similar. This is very “dangerous with very noisy data in that dissimilar instances may appear to be very much alike” (155). Third, we may do the pessimistic process of treating incomplete instances as entirely different from all other complete and incomplete instances. In this fashion we make sure that misclassifications do not occur. Your comment about null values is quite true. This is the reason that many data modelers will often create entry codes to represent ‘data unknown’ or ‘other’. So that a null value can only mean that the value was not entered and is truly missing. C. Data normalization and scaling Many data mining techniques require that data is transformed or standardized into a more palatable format for input into the technique (like neural networks). There are many reasons for normalizing. First, we may want to make the data and calculations more manageable for the technique. This may involve scaling data into a computationally easier format. Second, we may want to make the data more meaningful. By standardizing input data based upon statistical processes (like mean and standard deviation) we are better able to compare seemingly disparate data to a normed standard. Most of the techniques for data normalization and scaling fall under basic algebra and statistics (156). First, decimal scaling attempts to map a larger real valued range to a more manageable interval (such as (0,1) or (-1,1) (156). Not only does this make calculations easier but it also prepares the data for entry into techniques such as neural networks. Second, “min-max normalization” attempts to frame the data points in context with the demonstrated range of values. That is, if data exists only between 10 and 20 we are not concerned with data outside that area. Third, we can normalize using statistical distributions such as the standard normal distribution. Finally, we can utilize logarithmic transformation to flatten out disparate values or make categorical data more meaningful. D. Data type conversion Data type conversions can be an extraordinarily tricky proposition. First, when working on a data mining project you may encounter legacy (i.e., non-current) data formats that do not easily mesh well with one another. By transforming data to what is considered normal standards we must ensure that no detail or granularity is lost in the process. Second, in converting the data we must ensure that no destruction of the original data is incurred. If we transformed an all-categorical data set into numerical data for entry into a neural network it would be foolish to discard the original data formats, or the transformations. Just because we begin working in one data method does not mean we will stay in that method. Techniques to counter such conversion problems begin with a clear and loud statement of how data will be transformed and to what standard it should be transformed into. By having an agreed upon goal, and a written plan of how to attain the goal, there is at least precedence for how data transformations should occur. E. Attribute and instance selection Attribute and instance selection certainly will dictate the quality of the model and results that you obtain through your data mining technique. Certainly, we cannot enter every instance and every attribute into a training or test set due to time and validity constraints. Thus, we must make choice on what is important to use. First, by realizing what technique we are utilizing (and its corresponding requirements) we can obtain a better idea of what transformations and attributes will be necessary. Second, we can whittle down the number of attributes based on what our desired output is and what variables are of no interest to us. It is important not to dismiss candidate input variables that might illuminate certain relationships in the data, so this must be done with caution. We can solve the problems by a number of techniques; however, the best way to learn how to deal with these issues may simply be experience. In genetic algorithms and neural networks it is possible to pit attributes against other attributes in a survival of the fittest contest utilizing the goodness functions given. Once we have a filtered, smaller set we then can run the remaining candidate attributes through a supervised techniques (such as a decision tree) to see how good they truly are. Also, we can reduce the number of attributes by combining 2 or more desired attributes in a ratio or equation format. Finally, in dealing with attributes and instances we must again state that the set (and attributes) present in training data must truly be representative of the whole. Else, we risk modeling something entirely different than our goal. 2. (80 points) We've covered several data mining techniques in this course. For each technique identified below, describe the technique, identify which problems it is best suited for, identify which problems it has difficulties with, and describe any issues or limitations of the technique. a. Decision trees b. Association rules c. K-Means algorithm d. Linear regression e. Logistic regression f. Bayes classifier g. Neural networks h. Genetic algorithms For the following 8 data mining techniques I will summarize the technique (1), identify which problems it is best suited for (2), identify which problems it has difficulties with (3), and describe any issues or limitations of the technique (4): A. Decision trees 1. Decision trees are simple structures “where non-terminal nodes represent test on one or more attributes and terminal nodes reflect decision outcomes” (9). They are a very popular supervised learning technique involving training. A set of instances provides a subset of “training data” to help the model learn how data should be classified. Additional test data instances are categorized using the model (68). If the instances are classified correctly, we are done; if they are no the tree is altered until the desired test data set is classified correctly or exhausted (68). 2. Decision trees are a popular technique because they can tackle a wide array of problems (9). They can model both categorical (i.e., discrete) data and continuous data by creating branches based upon ranges of values (WWW1). More so, they are very easy to understand and apply to solving real world problems (78, WWW2). Problems that need some formal display of logic/rules/classification are good candidates for decision trees. Decision trees easily converted into sets of production rules (43), which show basic probabilistic quantifications. They are also good at indicating which fields are most significant in differentiating groups – they are those at the root node of the tree (WWW1). 3. While they overall might be the easiest of the data mining techniques decision trees are not ideal for solving every problem. The output attribute of a tree must be strictly categorical and single valued (78). The choice of training data may cause slight to major instabilities in the algorithms choice of branching (78). More so, decision trees may become complex and less intuitive as the number of input variables dramatically increases (78). Increased complexity may lead to unreasonable time, funds, and resources spent on training (WWW1). 4. As noted in 3, output attributes of decision trees are limited to single valued, categorical attributes (78). While easy to conceptually see on a small scale when data becomes more complex or is based off a time dimension (i.e., time series) the output and subsequent tree might not benefit the problem (WWW1). In many fields, such as finance and medicine, where continuous output variables are frequent another method may be better (WWW1). B. Association Rules 1. Association rules are a mining technique “use to discover interesting associations between attributes contained in a database” (49). They are unique in the fact that a single variable may be considered as an input and/or an output attribute (78). The most popular use of association rules is in the technique also known as market basket analysis (78). Main ways of gauging association rules are (101-102): --confidence (conditional probability B is true when A is true in an A—>B statement) --support (“minimum percentage of instances in the database that contain all items listed in a given association rule”) --accuracy (how often AB really holds) --coverage (what % of things having characteristic B that have characteristic A) 2. Association rules are frequently used in problems with several desired, multi-valued outputs (49). Problems requiring clear and understandable results with practical answers would be good candidates for this technique (WWW2). Association rules are a good method for problems where unsupervised mining is required (WWW1). Depending on the size of complexity of the data set one can literally search out all possible combinations of input and output attributes. Also, if the set data tends to be of “variable lengths” this technique might be good (WWW1). 3. The association rules technique is a poor choice on extremely large data sets due to the combinatorially taxing process of comparing the various inputs and outputs (WWW1). In this case, rules by method of exhaustion are not feasible. Problems with many data attributes are also poor candidates (WWW2). Problems where “rare” or outlier-like occurrences are of great importance are also poor candidates for association rules (49). Association rules typically have cutoff levels for its basic measures and while rare occurrences may fall under such a threshold they are nonetheless important. However, they may be discarded because they do not meet minimum criteria. 4. The power of association rules may be in the eye of the beholder. Requirements for minimum coverage, support, and confidence can all are set based on the tolerance or desire of an individual data miner. Thus, the ability of the technique to shed light on a problem may indeed be at the hands of the experimenter. More so, even numerically valued rule measures may be of little value when applied to the practical world. To know that there is an 80% chance that someone who buys beer on a Sunday might also buy potato chips may be trivial to a company or individual. It is the odd occurrences (i.e., buying beer and diapers) that are of great value. Additionally, one may want to use experience and other techniques to broaden the data set and its investigation to determine whether events are causal, correlative, or due to random events – which may or may not be important! When they work, they are also easy to explain to the business. C. K-Means Algorithm 1. The K-means algorithm is a popular means of performing unsupervised clustering when not a lot may be known about the data set. The iterative algorithm can be broken down into 5 steps (84): --Choose the # of clusters to be formed (K) --Choose K random points from the set (N) to be used as initial centers of the K clusters --Use the Euclidean distance formula to assign the N-K remaining points to the nearest center --Calculate new cluster centers based on the mean of the points in said cluster --If new = old center, done; If new <> old center, then repeat steps 3-5. 2. The K-means algorithm is very helpful in cases where little is known about the data set except for a specific goal – the number of clusters. This kind of clustering can be performed on a wide array of real valued data types (WWW1). The computation process can also be adapted to problems that require a “good enough” solution using cutoff values for closeness or iterations. It works well using brute force to determine cluster centers. 3. The algorithm has problems with categorical data (88). If categorical data needs to be clustered in this fashion it requires good data transformations to make the data real valued and consistent. If the miner has no idea how many clusters are desired the algorithm may produce useless data (88) – too high of an estimate will create singularity-like clusters, too low of an estimate will create large clusters that are grouped without practical meaning. Finally, problems that require an optimal solution may not be good candidates for this technique (87). While the clusters are guaranteed to stably form the optimizations may be local, or good enough, rather than globally optimal (87). Problems with a great deal of data points, or with large “n-tuples,” may be computationally taxing and require a large amount of time and resources to come to a stable solution. 4. The K-means algorithm, for its positives, is a somewhat clumsy technique. Many more adept clustering algorithms obtain clusters faster, or with better initial center choices. Another limitation of this technique is that the while the clusters are stable they are not consistent. The choice of initial centers greatly influences the final clustering solution (233). Since the K-means algorithm does a poor job of providing an explanation for its results it is often difficult to determine which variables are relevant/irrelevant to the solution (89). This can be remedied by testing the unsupervised clusters using a supervised data mining technique (such as decision trees or association rules) to gauge the “goodness” of the model. D. Linear Regression 1. Linear regression is a popular and powerful statistical technique that helps relate one (or more) input variables to a dependent variables. The general and simple linear equations appear below (292): f ( x1 , x2 ,..., xn ) a1 x1 an xn ... an xn c and y=ax+b where X’s are independent variables, A’s are coefficients that are determined, b is the y intercept, and c is a constant. Also, the mathematics and linear algebra that support such endeavors is powerful, somewhat easy to understand, and applicable in a wide array of cases. Further, many common statistics and measures can be used with confidence to measure the goodness of such a model. 2. Linear regression, naturally, works quite well on distributions where the implicit or explicit relationships are linear approximated. Linear regression works well on problems where the data is in real valued, numerical forms. Data with continuous ranges also works quite well in linear regression (298). 3. Linear regression tends to not work well on problems with curvilinear, polynomial, logistic, or scatter plot tendencies (i.e. most problems using real data – often simplification assumptions are used to support the use of linear regression) (298-299). The technique also tends to work less effectively on problems with greater and greater independent variables. This is due to the fact that as more “less important” variables are added to the model it will drag down the performance of its prediction. Another group of problems that linear regression has difficulties with is categorical variable problems. Categorical variables, especially binary variables, cannot be properly captured by a model that emphasizes real valued coefficients and estimates (299). 4. One limitation of the technique – dealing with non-linear data – can be mitigated by using weighted values or transformations of input variables (WWW3). In this fashion, while the input variables may be adjusted the linear model is still one of many valid ways to get a handle on the data. One issue with measuring the sample correlation coefficient, r, in a model is that a value between –1 +1 may not be sufficient to determine a relationship in a larger data set. While an r-value might not be near 1, it still may be significant once the model and a graphical representation of the model are valued. Very Good E. Logistic regression 1. Logistic regression is an extension of the linear regression technique. Logistic regression is defined by the equation (300): p ( y 1 | x) e axc 1 e axc where ax+c equals the first equation in D1. This process still models a dependent variable based upon 1+ independent variables. However, this technique allows solution by a wider array of problems and reduces sum of square errors (300). 2. Logistic regression works very well on problems that simple linear regression cannot normally handle – categorical/binary problems (299). As seen earlier in the class the equation from E1 shows that leads to a cumulative probability density curve where as x increases the P value gets asymptotically close to 1.0 (301). A similar trend can be shown with x decreasing and P going towards 0.0. Thus, if variable can only take upon 2,…,n values this scale can give an intuitive feel of the probability. In fact, each coefficient can be thought of a measure of probability for that given input variable, which leads to very powerful results. 3. Logistic regression may not be the best choice for problems whose input variables are mostly continuous; the time spent on transformation could be better used on goodness of fit calculations. Along those lines logistic regression does not have as many goodness of fit calculations (WWW4). Also, logistic work may be intimidating to some novice or superficial analysts dealing with data mining. 4. One limitation of logistic regression is that it works better under iterative tries. Once a first trial model is created one might be better off limiting the input variables to a select few which have been proven to be probabilistically significant (ie, high logistic coefficients). F. Bayes Classifier 1. This technique is a way to classify things in a supervised way based upon one thing: conditional probability. In statistics Bayes Theorem is stated as (302): P( H | E ) P( E | H ) * P( H ) P( E ) This equation yields the probability that a hypothesis is true based on given evidence that we know is true. Bayesian probability can easily be calculated as long as we know (or have reliable estimates) for hypothesis, evidence, and mixed probabilities. 2. Bayesian techniques work very well in small to moderately countable data sets where calculations of probability are based on work already done. This technique also works surprisingly well on a number of common, problematic issues. First, the technique accounts well for zero valued attribute counts (305). That is, if an attribute is zero valued Bayesian equations add a small nonzero constant to the equation to ensure nonzero answers. Thus, a 0 in a field does have some probability associated with it. Second, the technique similarly deals well with problems containing missing data. If a multiple input valued set has a field that is null the Bayes theorem above simply will discard the variable and create a probability associated with all the nontrivial input variables (306). 3. This technique does not work well with problems where multiple input variables may have a degree of dependence present between each other (302). Such a situation would cause conditional probabilities to be falsely elevated. Another type of problem that Bayesian techniques do not work well with is sets with unequally weighted input attributes (302). Bayesian probability assumes that events have the same, equal chance of occurring. If this is not true, the final probabilities will be in error. 4. One point of interest is that Bayesian classification can be done on numerical data utilizing a slightly altered version of the normal probability distribution. All we truly need to know, in this case, is the class mean, standard deviation, and value for the numerical attribute in question (307). However, we must note that such a distribution also assumes that an approximate normal distribution (especially as the number of instances present increases). G. Neural networks 1. Neural networks are a “mathematics model that attempts to mimic the human brain” (246). This data mining technique is built upon interconnected layers of nodes which allow input data to be transformed and altered into a final output answer by feeding such information forward (246). The layers typically consist of a larger input layer followed by one (or more) hidden layers and a final 1-node output layer. Random initial weights are assigned to the input layer and the values are literally fed forward via a series of calculations. As the input and output data is in the form of a real valued number in [0,1] it is relatively easy to see what the output is truly telling us. An interesting aside; neural networks were developed to model physical neural activity and better understand the mechanics of the biology. ANN purists frown on the use of neural networks as data analysis tools. I side with the broader interpretation of ANN use. I believe that since they model the categorization and classification methods we use to learn, that they validly mimic the conclusions we could make with our wetware (brains). 2. Neural networks can be used in a variety of problems because of their ability to do unsupervised and supervised learning. One can certainly use neural networks on “datasets containing large amounts of noisy input data” (256). This is a large advantage in using this on “practical” problems where even the best data sets may be incomplete. This technique is also good in dealing with categorical or numerical data (WWW1). They are also very applicable in a wide range of business, science, economic, and more liberal domains (WWW2). Finally, their practicality in problems involving time series makes them popular due to their ability to account for the dimension of time (256). 3. Despite all their positives neural networks do have difficulty with some types of problems. They do require transforming data onto the real valued [0,1] number line, which may be difficult to meaningfully transform data to (WWW2). Also, neural networks may converge to a solution that is less than optimal (257). Further, the solutions to some problems with limited data may perform very well under training circumstances but fail using test data due to “overtraining” (257). Finally, problems that require easily explainable results may not be the best candidates for entry into a neural network (WWW1). Neural networks may spit out results, even consistently, that may not have much significance (without using other functions to gauge their results). 4. Since neural network explanation is a huge issue there are a number of techniques to use in gauging their solutions. First, one can use sensitivity analysis – a process of making slight changes to see how the network adjusts – to see how to better construct the network (255). Second, one may use the average member technique to ensure that the data set is best represented in the training data (255). Finally, one can use backpropagation; this technique changes the input variables based on how the output appears. H. Genetic Algorithms 1. Genetic algorithms are a data mining technique based on Charles Darwin’s theory of evolution (101). This technique is quite popular due to its ability to form novel, unique solutions through supervised and unsupervised applications. The basic algorithm process can be described as (90): --Transform desired elements into a “chromosome” string, mostly coded into binary --Create an initial population of possible solution chromosomes --Create a “fitness function” to measure the quality of chromosomes --Use the fitness function to either keep the chromosomes intact, or reject them and create new strings using 1 of 3 genetic operators: 1. Crossover: creating new elements by copying/combining bits of existing elements 2. Mutation: randomly flipping a 0/1 in an elimination chromosome for novelty 3. Selection: replacing unfit chromosomes with entire copies of good chromosomes 2. Genetic algorithms work well on problems that require explainable results (WWW2). They can also handle a vast array of data (with some transformation) and can be used in a supervised or unsupervised manner (WWW2). Because of this natural ability to work with other techniques genetic algorithms mesh well with neural network problems (98). They also work well on practical problems that nontraditional methods find very time consuming and difficult, like scheduling and optimization (98). 3. To begin, genetic algorithms do not work well on problems where a great deal of transformation is required for the data to be in a sufficient state for easy chromosome based learning processes (WWW2). This technique is guaranteed to find optimized solutions; however, these optimizations may be due to local rather than global optimizations (98). Problems that require quick answers based on easily available techniques may not be the best to use this method on. Training and explanation of how to build and use these algorithms may be too costly (in terms of time and money) for a novice analyst (WWW2). 4. The main limitation of genetic algorithms is that their performance and results are only as good as their fitness function (98). To benefit the user a fitness function must help show what was originally intended and must have a significant amount of practical value associated with it. Another difficulty in the iterative process with this technique is which genetic operator (crossover, mutation, selection) to use, when, and how often. While mutation finds novel solutions that may not appear otherwise the process is computationally draining and leads to diminished results – most of the trials will be in error. More so, overusing a technique like selection may lead to a model appearing to be viable when in fact a certain family of chromosomes is dominating the other pairs in the solution. A variation of genetic algorithms that is worth considering is Evolutionary Algorithms. Where GA can be thought of as modeling a population of individual genomes to develop the fittest solution (a being), EAs model a population of solutions (individuals) and develop the fittest solution from this population. EAs work well in problems where it may be difficult or undesirable to express our model as a binary string. The members of an EA population are represented as the set of coefficients of our fitness equation. Think about how much easier it would be to solve a traveling salesman problem as an EA rather than as a GA.