Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSIS 5420 Mid-term Exam – Answers 1. In a supervised learner model we use a training dataset and a test dataset. The training dataset is used to train and refine the model. The test dataset represents any data we will submit to the model for classification. What constraints must we require that the training dataset conform to so that any test data will be classified correctly? A. Training set data must be proportionate to the test set data: we have seen that “models built with training data that does not represent the set of all possible domain instances” (215) can lead to misclassified test data. Especially when training data is spotty in nature or if the training data come from a tiny proportion of the population certain parts of the population may be misrepresented. For instance, if 25/1000 known instances are used for training there is a good chance that the few outliers that exist may be missed. Despite their small numbers these outliers have great value and should not be stricken from the training data without just cause (ie, addressing them another way). Ideally, the training set should be a smaller yet characteristically proportionate version of the test data. B. Redundant attributes must be addressed: redundant attributes can certainly wreak havoc upon output results. Redundant data, both positive (r=1.0) and negative (r=-1.0), can act as a weight in mining technique. Thus, the weighted values will skew the model and the test set data may not conform to it. Fortunately, there are a number of ways of locating, determining, and removing redundant data: correlation coefficients, scatter plot diagrams, etc (228). C. Missing values must be addressed: in the data preprocessing phase decisions need to be made as to how to handle missing or null attribute values on an entry by entry level. If the variables are insignificant to us (statistically or otherwise) we may choose to remove them from the training data or make them read only. However, for null values of important input attributes (ie, salary in a salary survey) a 0 value would skew the model and its computations. Thus, we must decide whether to ignore the values, enter in mean values, treat them as similar to other instances, or treat them as a special case on their own (155). D Input must be meaningful and error minimized: we have a variety of error checking and process simplifying mechanisms at our disposal. First, if input data is not meaningful to classifying the test data we can remove them from the model. Second, we can judge error using tools like confusion matrix to judge where misclassification occurs. It may be of great importance to keep Type 2 errors (should’ve been rejected, but was accepted by computer) below a threshold level while Type 1 errors may be more tolerant. Third, we may use the 95% confidence intervals on the error to iteratively adjust the test set data to reduce the size of the interval or the mean value. We must note, however, that unless the size of the test set is over 100 instances the error may need to be taken with a grain of salt (129). 2. In unsupervised clustering we adjust, or the data mining tool adjusts, a cluster membership fitness threshold value. This value determines whether a piece of data will belong, or not belong, to a particular cluster. Tuning this value adjusts the number of clusters a model will partition the training data into. Why is it important to tune the number of clusters that our unsupervised clustering model will create? In unsupervised clustering we adjust, or the data mining tool adjusts, a cluster membership fitness threshold value. However, it is incredibly important for us to responsibly tune and estimate the number of clusters or else the data will be open to a variety of misclassifications. The main goal of unsupervised clustering is to create K clusters where the entries within each cluster are very similar, but the clusters are different from one another (WWW11). However, one of the drawbacks of unsupervised clustering is that we want to know an estimated number of clusters that we want our final output to conform to. With predetermined Boolean outputs (yes/no, male/female, sick/not sick) our choice is made for us. In more complex examples, especially when dealing with continuous or multi-valued categorical data using, a poor estimate could be disastrous. If we choose too many clusters for another data mining technique, like K-means algorithm, clusters will be so segmented they will lose meaning. If we choose too few we will have different entries grouped together. Another drawback is that programs like IDA/ESX allow you manipulate the clustering power without a sense of the optimal number of clusters. “Instance similarity” (and to a lesser degree “Real-valued tolerance setting”) can be manipulated to obtain fewer clusters (0) or more clusters (100). If we obtain more clusters than we need here we also run the chance of producing N singularity clusters (each cluster with 1 entry). If we obtain fewer clusters than we need distinct characteristics will lose their value. We can, however, put more constraints on this unsupervised process. First, we may try some alternative cluster estimates or multiple tries in IDA with varying instance similarities. A common technique is to begin at K = 10 clusters and adjust accordingly from there when doing K-means algorithms. For unsupervised clusters we can begin at Instance Similarity = 50 and adjust as needed after viewing basic statistics on the trials. Second, we may put limiting conditions on what passes for a cluster. Such conditions could include: minimum distance, number of passes/tries, minimum cluster size (if it is too small it is thrown out), a measure of how close points are within the cluster, a measure of how different points from different clusters are, etc. (WWW12) vary the number of tries. WWW11 = http://www.palantir.swarthmore.edu/loicz/help/clustering.htm WWW12 = http://www.geo.utep.edu/pub/seeley/Classification_tutorial_multispec.html 3. The K-Means algorithm has some issues that can cause poor or incorrect results. What are these issues? What data preprocessing can we use to either detect or avoid each issue? The K-means algorithm certainly has some issues that can cause poor, incorrect, or disappointing results. A sample of such issues (and their solutions) is below: A. The number of clusters must be estimated in Step 1 of the algorithm (88) and this is not always possible. By underestimating we create unnatural clusters that do not represent something real; by overestimating we weaken the power to easily grasp the true relationships. Using unsupervised clustering and even unsupervised neural networks to see what naturally groups together and what characteristics are strong can solve this. We can base initial supervised clusters on this. Basic statistics also help to explain the nature of data set. Good point about pre-testing the data clustering. B. The initial cluster centers are picked and random and most likely will be suboptimal. Even worse, the centers may end up being in the same cluster or very removed from the stable center. Thus, while a stable center is guaranteed time is wasted on calculations. This can be solved by calculation of pair wise distance between points to locate a handful of removed points. These may be used as initial cluster centers (regardless of the number of initial clusters). Good point about initial data analysis. C. The ability to try and judge alternatives is severely hampered as the number of points and pairs increase. The combinatorial explosion makes comparing every possible combination of points, and using different initial centers, virtually impossible. Thus, even the search for better suboptimal solutions is hampered due to time and resource constraints. This can be solved be by picking better centers (see 1B above) which will eliminate some calculations. One can also put limiting or terminating conditions on calculations of alternatives; once the condition, like minimum squared error below a certain threshold (88), calculation of alternatives can cease. Good D. Our choice of initial cluster centers will be reflected in our final cluster centers. More so, “we may see a different final cluster configuration for each alternative choice of the initial cluster centers” (87). Thus, optimal near optimal answers become more elusive. Once again, picking better centers to begin with by using an initial Euclidean distance measurement and picking remote points as centers can solve this. Also, by understanding the data set using graphical plots we may be able to come to peace that a suboptimal solution is (and to some extent, should be) acceptable. E. Even outside of the problems of center choice and combinatorial explosion we see that optimal solutions require alternative computations, which may be impractical. The process of calculating and recalculating centers and distances can be somewhat numbing as the set size increases. However, scaling down large sets to smaller can solve this, yet equally representative, sets. This allows computations and measures of goodness to be done on a more manageable set. While the solution will technically be suboptimal it can certainly meet satisfying conditions. We must be careful to note that if this sampling is done the test sets must mimic the whole set or our calculations and estimates will be for naught. F. The K-Means algorithm “only works with real valued data” (88). We have two ways of dealing with categorical data: discard them or convert them. Discarding values strips valuable data from the set and converting values may be time consuming or create numerical measures that do not properly represent the data. Running unsupervised clustering to identify important categorical variables can solve this. Those deemed important, or interesting, can be transformed by assigning random (yet evenly spaced) values to the categorical attributes. These may be on a scale of (0,1) or done using real numbers. G. The K-means algorithm tends to work “best when the clusters that exist in the data are of approximately equal size” (89). Unbalanced data sets may force calculated centers to be far away from any data point (from a small data set with at least one outlier) or lumped in together with reoccurring values (from a large data set with repetitive values). This can be solved by more than one method. First, weights can be assigned to data points to counterbalance extraordinarily large or small data sets. Second, one may be able to break large clusters into smaller (yet tightly grouped) clusters whose mean average value may better approximate the larger cluster’s mean. H. Both attribute significance and clusters themselves cannot be fully explained using the K-Means algorithm. Clusters with stable centers will appear, but how an attribute or initial center choice precisely affects the final result can be hazy at best. However, using alternative methods to evaluate such choices can solve this. The use of sensitivity analysis, genetic algorithms, and numerical clustering can measure (and even mimic) the K-means clustering process. Values obtained from these processes can then be measured against the K-Means clusters to help give some frame of reference for comparisons. 4. Genetic algorithms are powerful tools but they have some issues that effect their results or their performance. What are these issues? What methodologies can we use to either mitigate or avoid each issue? Genetic algorithms certainly have some issues that can cause poor, incorrect, or disappointing results. A sample of such issues (and their solutions) is below: A. They “are designed to find globally optimized solutions” (98); however, it is quite difficult to show or prove that an optimized solution is not due to local optimization. To ensure that a solution is globally optimal would require a fair number of tries, each of which narrows the parameters for change (crossover, mutation, and selection). Using some form of sensitivity analysis and adjusting the use of genetic techniques can solve this. More so, one can ensure a more globally acceptable answer by eliminating unnecessary variables prior to introduction to the algorithm. B. Convergence in a genetic algorithm may be premature (WWW2). In a small data set, or a data set with repetitive values, the early dominance of a chromosome (or part of one) can easily influence and skew the algorithmic process. This can be solved through the use of dampening or scaling transformations of the variables in question to create a fairer and balanced picture. C. The length, complexity, and resource drain due to a specific fitness function may prove costly on a number of levels (98). Due to the random nature of genetic algorithms solutions may also arise that are incorrect (WWW1). It has been show that error reduction and, hence, more fit solutions are accomplished via multiple running of an algorithm along with incremental adjustments (253). Thus, we must consider the multiplicity of work needed to reduce such error. Fashioning as elegant of an equation as possible, specifically built to measure the desired piece of knowledge, can solve this. More so, reduction of insignificant variables and proper coding of categorical variables in the preprocessing stage can ease the burden on a function. D. The explanation power of a genetic algorithm results is directly related to how understandable it is (98). It has been shown that a good genetic algorithm must be “run for a very long time or have excellent design rules in order to develop meaningful and optimal solutions to a problem” (WWW3). A large problem regarding fitness function answers is that their numerical values may not be inherently meaningful without context. The genetic algorithm may require some data transformation prior to entry into the algorithm. When categorical variables are transformed to either a (0,1), discrete, or real valued scale numerical ordering may not be meaningful. That is, if we assign 0 = Male and 1 = Female does it mean that one is better than the other? We can solve these problems by putting the solution, and equation, in real world context. A good fitness function must be able to produce a practical answer. Likewise, the answer must make sense in terms of the input variables. If we get an answer of .25 we must know whether a higher/lower score is better, or if a close value to a discrete or scaled value is important. E. The use of genetic algorithm techniques (crossover, mutation, selection) may be more of an art than a science. Numerous mutation tries may yield only a handful of viable (even very viable) alternative chromosomes. Selection, especially in small or limited data sets, can cause dominant features to overrun the model. Even crossover techniques are combinatorially taxing. The rate of change is also of great concern – too fast and it and significant variables may be wrongly kicked out, too slow and resources will be wasted (WWW2). The solution to this problem is experience and insight. Experience will dictate how much each technique should be used and insight of the project’s limitations (time, money, resources, goal) will also dictate this. WWW1 = http://www.cs.bgu.ac.il/~omri/NNUGA/ WWW2 = http://www.talkorigins.org/faqs/genalg/genalg.html#limitations WWW3 = http://uk.geocities.com/neilcarrott/intro.htm 5. The quality of a data mining model is highly dependent on the data used to train or develop the model. Discuss the techniques used to preprocess our data prior to data mining. The quality of a data mining model is highly dependent on the data used to train or develop the model. While there are a number of concerns and techniques involving data preprocessing the most prevalent form is data cleaning (153). We must note that data transformation techniques and data preprocessing techniques may share some common means. However, as I noted on Week 3’s HW assignment (#1a): “Transformation involves more of a conversion of data from one form to another via normalization, algorithms, statistics, or moving between categorical and numerical data. Some attributes may be selected over other dependent or minor ones, but they still exist. Data cleaning, on the other hand, "changes" the data by removing duplicates, correcting attribute errors, and dealing properly with nulls (vs. zero or intentionally empty values). If this did not occur prior to transformation there would be no way to easily correct these errors (i.e., post normalization or transformation).” Here is a sampling of the main techniques used to preprocess the data (153-155, WWW8): A. Data cleaning: this technique is concerned with “accounting for noise and dealing with missing information” (153). Noisy data suggests that there are errors, duplicate values, powerful outliers, and inconsistently entered values in the data set. This can be accomplished using programs, histograms, regression, and human experience (WWW9). Duplicate entries in a database can be costly to a company or individual; thus, trained individuals (with the help of graphical programs) can help whittle off excess entries and reduce cost and noise (153). Incorrect attribute values manifest themselves in entries which contain “discrepancies in codes or names” (WWW9) or data types. Hence, noise would be increased by a Boolean attribute, which contains sentences and short phrases, or a “sex” field with decimal values. With data cleaning even basic statistics will include these errors, which many systems can find, but cannot explain. Data smoothing – which is also a data transformation technique – can also be used here to round numbers or do minor adjustments to the data (i.e., inserting mean values where there are none) (154). Data cleaning can also be helpful when dealing with missing data. Missing data can greatly alter statistical measures and even go as far as to prevent proper clustering and rule creation. The three most common techniques in dealing with null values are: discard missing values (if they are small), compare entries with missing values to similar complete entries, and treat missing values as if they are unique and non-comparable. B. Sampling: this technique involves selecting “a representative subset forms a large population of data” (WWW10). Sometimes including all the data from a population into a data mining model is not feasible due to size, complexity, or statistical significance problems (WWW9). Proper sampling will ensure that the test data (or data sample) is truly representative of the population as a whole. C. Feature selection: this technique involves the paring of the data set down to its essential characteristics. If one is sure certain variables are undesirable or not significant for the given study they can simply be removed from training and test consideration. By removing such variables the computations become easier and there are fewer degrees of freedom to account for in statistical significance. Feature selection also makes it easier to scan the remaining attributes for additional odd values or cases, which may signal important relationships or changes within the population. D. Mapping: this technique may also nudge itself over into the data transformation category as well. Mapping may turn time series or sequential data into a more manageable form (WWW8). WWW8 = http://www.ims.nus.edu.sg/preprints/2003-15.pdf WWW9 = http://ir.iit.edu/~nazli/cs422/CS422-Slides/DM-Preprocessing.pdf WWW10 = http://searchdatabase.techtarget.com/sDefinition/0,,sid13_gci810056,00.html 6. Neural networks have three significant weaknesses. What are they? Give an example that demonstrates each weakness. How can we mitigate each weakness? The 3 major weaknesses of neural networks are (257): A. “The lack the ability to explain their behavior”: While neural networks are quite adept at pattern recognition they simply are not equipped to explain their choices and output. Neural networks can learn and become more intelligent via representative training data but they do not have the ability to make value judgments concerning whether an answer makes real world sense. Especially as hidden layers increase the interactions in a network may become “black box” like and increasingly difficult to explain (WWW5). An example of this could be found in a clinical study regarding the prediction of breast cancer patients in a population. It is quite possible that a model, based on training data, could show that 80% of left handed women who own cats and have black hair are likely to be breast cancer patients. While in practice this seems utterly ridiculous to you and me the neural network cannot differentiate between this pattern and a more meaningful one (history of cancer in a family, hazardous workplace exposure, etc.). To mitigate these problems we can use a number of techniques. First, we must ensure the test data (in size and characteristics) is truly representative of the sample or population as a whole. If the training data contains a disproportionate percentage of the desired population demographic when “normal” test data was fed into the network reality will be poorly represented. More so, if the size of the study in general is not large enough (ie, 10 women, 4/5 who meet the criteria) individual characteristics may skew the results. Second, we must utilize other data mining techniques to help back up such findings. One may run such training and testing data through alternative techniques such as decision trees or rule generation to quantify the relationships and interactions in the data. Thus, if ludicrous or overly obvious connections arise they will be displayed in a meaningful way for intelligent humans to comment on. Finally, sensitivity analysis can be utilized to show what input variables are most important and how variable they are (WWW4). B. They “are not guaranteed to converge to an optimal solution”: While solutions can be obtained using a variety of constraints, weights, and inputs it is often difficult to ensure that the goal of a globally optimal solution is ever met. More so, it may be quite impractical and unrealistic to expect a limited neural network to spit out a perfect solution via an imperfect process. An example of this could be found in a stock market problem. When trying to calculate the precise future value of a publicly traded security many input variables and factors are necessary for a complete data set. A variety of financial measures, market conditions, forecasting data, marketing strategies, cost studies, and even investor profiles can be used to predict future stock prices. As the sheer number and complexity of the input variables increase the network may be comprised of “dozens of neurons with a couple hundred connections between them” (WWW6). Thus, “the number of degrees of freedom of the created forecasting model (these are weights of all connections between the network neurons) often becomes larger than the number of examples (separate data records) that had been used to train the network.” Hence, the fluctuation and interactions of so many variables may prevent an optimal solution form occurring – practically and theoretically. To mitigate these problems we can use a number of techniques. First, we must again consider the use of sensitivity analysis in making sense of our results. Such analysis would allow us to better understand the “rank ordering for the relative importance of individual attributes” (255). Second, we could create a more manageable test set that focuses on these attributes. By limiting the number of representative instances we can slow the combinatorial explosion of interactions in the model. Finally, we may obtain the most optimal solution by satisfying. With a large number of variables and interactions conversion to a certain solution may be impractical (and not very meaningful). By putting a limiting value – such as a set number of iterations and passes or a “good enough” node output score – on the network we may actually end up with a better answer. C. They can be “over trained to the point of working well on the training data but poorly on test data”: One of the great perils of neural networks is that the network may be finely tuned to the training data but work poorly on the training data. This can be attributed to more than just poor instance choices. Neural networks are only as good as their training data, and training data may be limited in the real world. More so, small changes in the training set or weights – even due to random changes – can cause drastic changes (WWW7). An example of this can be found in a small, new house loan business. Such an organization, even run by experienced individuals, might want to use neural networks to see what clients would be likely to default or pay off loans. Based on their small client base (N=10) such a network would be limited and some input variables could dominate the model. As more clients are fed into the network it will evolve and the older outputs change in ways that are not totally predictable. Thus, each client may make the model better or (as an outlier) destroy it. To mitigate these problems we can “constantly measure test set performance” (257). Once again, sensitivity analysis would give us a better idea how a model changes by adding one or more entries to the input. Second, if a model is generally unstable statistically techniques to dampen the noise or algorithms like “bagging” or sigmoid function to smooth out problems. Finally, the training data can be altered to better reflect the true nature of the sample. Another option is running different equally sized sample from the population and averaging the values to minimize error. WWW4 = http://www.netnam.vn/unescocourse/knowlegde/65.htm WWW5 = http://ailab.ch/teaching/classes/2004ss/nn/lecture1.pdf WWW6 = http://www.megaputer.com/dm/systems.php3 WWW7 = http://www.predictiondynamics.com/education/es1999-3.pdf