Download CSIS 5420 Mid-term Exam

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
CSIS 5420 Mid-term Exam – Answers
1. In a supervised learner model we use a training dataset and a test dataset. The training
dataset is used to train and refine the model. The test dataset represents any data we will
submit to the model for classification. What constraints must we require that the training
dataset conform to so that any test data will be classified correctly?
A. Training set data must be proportionate to the test set data: we have seen that “models built with
training data that does not represent the set of all possible domain instances” (215) can lead to
misclassified test data. Especially when training data is spotty in nature or if the training data come from
a tiny proportion of the population certain parts of the population may be misrepresented. For instance, if
25/1000 known instances are used for training there is a good chance that the few outliers that exist may
be missed. Despite their small numbers these outliers have great value and should not be stricken from
the training data without just cause (ie, addressing them another way). Ideally, the training set should be
a smaller yet characteristically proportionate version of the test data.
B. Redundant attributes must be addressed: redundant attributes can certainly wreak havoc upon output
results. Redundant data, both positive (r=1.0) and negative (r=-1.0), can act as a weight in mining
technique. Thus, the weighted values will skew the model and the test set data may not conform to it.
Fortunately, there are a number of ways of locating, determining, and removing redundant data:
correlation coefficients, scatter plot diagrams, etc (228).
C. Missing values must be addressed: in the data preprocessing phase decisions need to be made as to
how to handle missing or null attribute values on an entry by entry level. If the variables are insignificant
to us (statistically or otherwise) we may choose to remove them from the training data or make them read
only. However, for null values of important input attributes (ie, salary in a salary survey) a 0 value would
skew the model and its computations. Thus, we must decide whether to ignore the values, enter in mean
values, treat them as similar to other instances, or treat them as a special case on their own (155).
D Input must be meaningful and error minimized: we have a variety of error checking and process
simplifying mechanisms at our disposal. First, if input data is not meaningful to classifying the test data
we can remove them from the model. Second, we can judge error using tools like confusion matrix to
judge where misclassification occurs. It may be of great importance to keep Type 2 errors (should’ve
been rejected, but was accepted by computer) below a threshold level while Type 1 errors may be more
tolerant. Third, we may use the 95% confidence intervals on the error to iteratively adjust the test set
data to reduce the size of the interval or the mean value. We must note, however, that unless the size of
the test set is over 100 instances the error may need to be taken with a grain of salt (129).
2. In unsupervised clustering we adjust, or the data mining tool adjusts, a cluster
membership fitness threshold value. This value determines whether a piece of data will
belong, or not belong, to a particular cluster. Tuning this value adjusts the number of
clusters a model will partition the training data into. Why is it important to tune the number
of clusters that our unsupervised clustering model will create?
In unsupervised clustering we adjust, or the data mining tool adjusts, a cluster membership fitness
threshold value. However, it is incredibly important for us to responsibly tune and estimate the number of
clusters or else the data will be open to a variety of misclassifications.
The main goal of unsupervised clustering is to create K clusters where the entries within each cluster are
very similar, but the clusters are different from one another (WWW11). However, one of the drawbacks of
unsupervised clustering is that we want to know an estimated number of clusters that we want our final
output to conform to. With predetermined Boolean outputs (yes/no, male/female, sick/not sick) our choice
is made for us. In more complex examples, especially when dealing with continuous or multi-valued
categorical data using, a poor estimate could be disastrous. If we choose too many clusters for another
data mining technique, like K-means algorithm, clusters will be so segmented they will lose meaning. If
we choose too few we will have different entries grouped together.
Another drawback is that programs like IDA/ESX allow you manipulate the clustering power without a
sense of the optimal number of clusters. “Instance similarity” (and to a lesser degree “Real-valued
tolerance setting”) can be manipulated to obtain fewer clusters (0) or more clusters (100). If we obtain
more clusters than we need here we also run the chance of producing N singularity clusters (each cluster
with 1 entry). If we obtain fewer clusters than we need distinct characteristics will lose their value.
We can, however, put more constraints on this unsupervised process. First, we may try some alternative
cluster estimates or multiple tries in IDA with varying instance similarities. A common technique is to
begin at K = 10 clusters and adjust accordingly from there when doing K-means algorithms. For
unsupervised clusters we can begin at Instance Similarity = 50 and adjust as needed after viewing basic
statistics on the trials. Second, we may put limiting conditions on what passes for a cluster. Such
conditions could include: minimum distance, number of passes/tries, minimum cluster size (if it is too
small it is thrown out), a measure of how close points are within the cluster, a measure of how different
points from different clusters are, etc. (WWW12) vary the number of tries.
WWW11 = http://www.palantir.swarthmore.edu/loicz/help/clustering.htm
WWW12 = http://www.geo.utep.edu/pub/seeley/Classification_tutorial_multispec.html
3. The K-Means algorithm has some issues that can cause poor or incorrect results. What
are these issues? What data preprocessing can we use to either detect or avoid each
issue?
The K-means algorithm certainly has some issues that can cause poor, incorrect, or disappointing results.
A sample of such issues (and their solutions) is below:
A. The number of clusters must be estimated in Step 1 of the algorithm (88) and this is not always
possible. By underestimating we create unnatural clusters that do not represent something real; by
overestimating we weaken the power to easily grasp the true relationships. Using unsupervised
clustering and even unsupervised neural networks to see what naturally groups together and what
characteristics are strong can solve this. We can base initial supervised clusters on this. Basic statistics
also help to explain the nature of data set. Good point about pre-testing the data clustering.
B. The initial cluster centers are picked and random and most likely will be suboptimal. Even worse, the
centers may end up being in the same cluster or very removed from the stable center. Thus, while a
stable center is guaranteed time is wasted on calculations. This can be solved by calculation of pair wise
distance between points to locate a handful of removed points. These may be used as initial cluster
centers (regardless of the number of initial clusters). Good point about initial data analysis.
C. The ability to try and judge alternatives is severely hampered as the number of points and pairs
increase. The combinatorial explosion makes comparing every possible combination of points, and using
different initial centers, virtually impossible. Thus, even the search for better suboptimal solutions is
hampered due to time and resource constraints. This can be solved be by picking better centers (see 1B
above) which will eliminate some calculations. One can also put limiting or terminating conditions on
calculations of alternatives; once the condition, like minimum squared error below a certain threshold
(88), calculation of alternatives can cease. Good
D. Our choice of initial cluster centers will be reflected in our final cluster centers. More so, “we may see
a different final cluster configuration for each alternative choice of the initial cluster centers” (87). Thus,
optimal near optimal answers become more elusive. Once again, picking better centers to begin with by
using an initial Euclidean distance measurement and picking remote points as centers can solve this.
Also, by understanding the data set using graphical plots we may be able to come to peace that a
suboptimal solution is (and to some extent, should be) acceptable.
E. Even outside of the problems of center choice and combinatorial explosion we see that optimal
solutions require alternative computations, which may be impractical. The process of calculating and
recalculating centers and distances can be somewhat numbing as the set size increases. However,
scaling down large sets to smaller can solve this, yet equally representative, sets. This allows
computations and measures of goodness to be done on a more manageable set. While the solution will
technically be suboptimal it can certainly meet satisfying conditions. We must be careful to note that if
this sampling is done the test sets must mimic the whole set or our calculations and estimates will be for
naught.
F. The K-Means algorithm “only works with real valued data” (88). We have two ways of dealing with
categorical data: discard them or convert them. Discarding values strips valuable data from the set and
converting values may be time consuming or create numerical measures that do not properly represent
the data. Running unsupervised clustering to identify important categorical variables can solve this.
Those deemed important, or interesting, can be transformed by assigning random (yet evenly spaced)
values to the categorical attributes. These may be on a scale of (0,1) or done using real numbers.
G. The K-means algorithm tends to work “best when the clusters that exist in the data are of
approximately equal size” (89). Unbalanced data sets may force calculated centers to be far away from
any data point (from a small data set with at least one outlier) or lumped in together with reoccurring
values (from a large data set with repetitive values). This can be solved by more than one method. First,
weights can be assigned to data points to counterbalance extraordinarily large or small data sets.
Second, one may be able to break large clusters into smaller (yet tightly grouped) clusters whose mean
average value may better approximate the larger cluster’s mean.
H. Both attribute significance and clusters themselves cannot be fully explained using the K-Means
algorithm. Clusters with stable centers will appear, but how an attribute or initial center choice precisely
affects the final result can be hazy at best. However, using alternative methods to evaluate such choices
can solve this. The use of sensitivity analysis, genetic algorithms, and numerical clustering can measure
(and even mimic) the K-means clustering process. Values obtained from these processes can then be
measured against the K-Means clusters to help give some frame of reference for comparisons.
4. Genetic algorithms are powerful tools but they have some issues that effect their results
or their performance. What are these issues? What methodologies can we use to either
mitigate or avoid each issue?
Genetic algorithms certainly have some issues that can cause poor, incorrect, or disappointing results. A
sample of such issues (and their solutions) is below:
A. They “are designed to find globally optimized solutions” (98); however, it is quite difficult to show or
prove that an optimized solution is not due to local optimization. To ensure that a solution is globally
optimal would require a fair number of tries, each of which narrows the parameters for change (crossover,
mutation, and selection). Using some form of sensitivity analysis and adjusting the use of genetic
techniques can solve this. More so, one can ensure a more globally acceptable answer by eliminating
unnecessary variables prior to introduction to the algorithm.
B. Convergence in a genetic algorithm may be premature (WWW2). In a small data set, or a data set
with repetitive values, the early dominance of a chromosome (or part of one) can easily influence and
skew the algorithmic process. This can be solved through the use of dampening or scaling
transformations of the variables in question to create a fairer and balanced picture.
C. The length, complexity, and resource drain due to a specific fitness function may prove costly on a
number of levels (98). Due to the random nature of genetic algorithms solutions may also arise that are
incorrect (WWW1). It has been show that error reduction and, hence, more fit solutions are accomplished
via multiple running of an algorithm along with incremental adjustments (253). Thus, we must consider
the multiplicity of work needed to reduce such error. Fashioning as elegant of an equation as possible,
specifically built to measure the desired piece of knowledge, can solve this. More so, reduction of
insignificant variables and proper coding of categorical variables in the preprocessing stage can ease the
burden on a function.
D. The explanation power of a genetic algorithm results is directly related to how understandable it is
(98). It has been shown that a good genetic algorithm must be “run for a very long time or have excellent
design rules in order to develop meaningful and optimal solutions to a problem” (WWW3). A large
problem regarding fitness function answers is that their numerical values may not be inherently
meaningful without context. The genetic algorithm may require some data transformation prior to entry
into the algorithm. When categorical variables are transformed to either a (0,1), discrete, or real valued
scale numerical ordering may not be meaningful. That is, if we assign 0 = Male and 1 = Female does it
mean that one is better than the other? We can solve these problems by putting the solution, and
equation, in real world context. A good fitness function must be able to produce a practical answer.
Likewise, the answer must make sense in terms of the input variables. If we get an answer of .25 we
must know whether a higher/lower score is better, or if a close value to a discrete or scaled value is
important.
E. The use of genetic algorithm techniques (crossover, mutation, selection) may be more of an art than a
science. Numerous mutation tries may yield only a handful of viable (even very viable) alternative
chromosomes. Selection, especially in small or limited data sets, can cause dominant features to overrun
the model. Even crossover techniques are combinatorially taxing. The rate of change is also of great
concern – too fast and it and significant variables may be wrongly kicked out, too slow and resources will
be wasted (WWW2). The solution to this problem is experience and insight. Experience will dictate how
much each technique should be used and insight of the project’s limitations (time, money, resources,
goal) will also dictate this.
WWW1 = http://www.cs.bgu.ac.il/~omri/NNUGA/
WWW2 = http://www.talkorigins.org/faqs/genalg/genalg.html#limitations
WWW3 = http://uk.geocities.com/neilcarrott/intro.htm
5. The quality of a data mining model is highly dependent on the data used to train or
develop the model. Discuss the techniques used to preprocess our data prior to data
mining.
The quality of a data mining model is highly dependent on the data used to train or develop the model.
While there are a number of concerns and techniques involving data preprocessing the most prevalent
form is data cleaning (153). We must note that data transformation techniques and data preprocessing
techniques may share some common means. However, as I noted on Week 3’s HW assignment (#1a):
“Transformation involves more of a conversion of data from one form to another via normalization,
algorithms, statistics, or moving between categorical and numerical data. Some attributes may be
selected over other dependent or minor ones, but they still exist. Data cleaning, on the other hand,
"changes" the data by removing duplicates, correcting attribute errors, and dealing properly with nulls (vs.
zero or intentionally empty values). If this did not occur prior to transformation there would be no way to
easily correct these errors (i.e., post normalization or transformation).”
Here is a sampling of the main techniques used to preprocess the data (153-155, WWW8):
A. Data cleaning: this technique is concerned with “accounting for noise and dealing with missing
information” (153). Noisy data suggests that there are errors, duplicate values, powerful outliers, and
inconsistently entered values in the data set. This can be accomplished using programs, histograms,
regression, and human experience (WWW9). Duplicate entries in a database can be costly to a company
or individual; thus, trained individuals (with the help of graphical programs) can help whittle off excess
entries and reduce cost and noise (153). Incorrect attribute values manifest themselves in entries which
contain “discrepancies in codes or names” (WWW9) or data types. Hence, noise would be increased by
a Boolean attribute, which contains sentences and short phrases, or a “sex” field with decimal values.
With data cleaning even basic statistics will include these errors, which many systems can find, but
cannot explain. Data smoothing – which is also a data transformation technique – can also be used here
to round numbers or do minor adjustments to the data (i.e., inserting mean values where there are none)
(154). Data cleaning can also be helpful when dealing with missing data. Missing data can greatly alter
statistical measures and even go as far as to prevent proper clustering and rule creation. The three most
common techniques in dealing with null values are: discard missing values (if they are small), compare
entries with missing values to similar complete entries, and treat missing values as if they are unique and
non-comparable.
B. Sampling: this technique involves selecting “a representative subset forms a large population of data”
(WWW10). Sometimes including all the data from a population into a data mining model is not feasible
due to size, complexity, or statistical significance problems (WWW9). Proper sampling will ensure that
the test data (or data sample) is truly representative of the population as a whole.
C. Feature selection: this technique involves the paring of the data set down to its essential
characteristics. If one is sure certain variables are undesirable or not significant for the given study they
can simply be removed from training and test consideration. By removing such variables the
computations become easier and there are fewer degrees of freedom to account for in statistical
significance. Feature selection also makes it easier to scan the remaining attributes for additional odd
values or cases, which may signal important relationships or changes within the population.
D. Mapping: this technique may also nudge itself over into the data transformation category as well.
Mapping may turn time series or sequential data into a more manageable form (WWW8).
WWW8 = http://www.ims.nus.edu.sg/preprints/2003-15.pdf
WWW9 = http://ir.iit.edu/~nazli/cs422/CS422-Slides/DM-Preprocessing.pdf
WWW10 = http://searchdatabase.techtarget.com/sDefinition/0,,sid13_gci810056,00.html
6. Neural networks have three significant weaknesses. What are they? Give an example that
demonstrates each weakness. How can we mitigate each weakness?
The 3 major weaknesses of neural networks are (257):
A. “The lack the ability to explain their behavior”: While neural networks are quite adept at pattern
recognition they simply are not equipped to explain their choices and output. Neural networks can learn
and become more intelligent via representative training data but they do not have the ability to make
value judgments concerning whether an answer makes real world sense. Especially as hidden layers
increase the interactions in a network may become “black box” like and increasingly difficult to explain
(WWW5).
An example of this could be found in a clinical study regarding the prediction of breast cancer patients in
a population. It is quite possible that a model, based on training data, could show that 80% of left handed
women who own cats and have black hair are likely to be breast cancer patients. While in practice this
seems utterly ridiculous to you and me the neural network cannot differentiate between this pattern and a
more meaningful one (history of cancer in a family, hazardous workplace exposure, etc.).
To mitigate these problems we can use a number of techniques. First, we must ensure the test data (in
size and characteristics) is truly representative of the sample or population as a whole. If the training data
contains a disproportionate percentage of the desired population demographic when “normal” test data
was fed into the network reality will be poorly represented. More so, if the size of the study in general is
not large enough (ie, 10 women, 4/5 who meet the criteria) individual characteristics may skew the
results. Second, we must utilize other data mining techniques to help back up such findings. One may
run such training and testing data through alternative techniques such as decision trees or rule generation
to quantify the relationships and interactions in the data. Thus, if ludicrous or overly obvious connections
arise they will be displayed in a meaningful way for intelligent humans to comment on. Finally, sensitivity
analysis can be utilized to show what input variables are most important and how variable they are
(WWW4).
B. They “are not guaranteed to converge to an optimal solution”: While solutions can be obtained using
a variety of constraints, weights, and inputs it is often difficult to ensure that the goal of a globally optimal
solution is ever met. More so, it may be quite impractical and unrealistic to expect a limited neural
network to spit out a perfect solution via an imperfect process.
An example of this could be found in a stock market problem. When trying to calculate the precise future
value of a publicly traded security many input variables and factors are necessary for a complete data set.
A variety of financial measures, market conditions, forecasting data, marketing strategies, cost studies,
and even investor profiles can be used to predict future stock prices. As the sheer number and
complexity of the input variables increase the network may be comprised of “dozens of neurons with a
couple hundred connections between them” (WWW6). Thus, “the number of degrees of freedom of the
created forecasting model (these are weights of all connections between the network neurons) often
becomes larger than the number of examples (separate data records) that had been used to train the
network.” Hence, the fluctuation and interactions of so many variables may prevent an optimal solution
form occurring – practically and theoretically.
To mitigate these problems we can use a number of techniques. First, we must again consider the use of
sensitivity analysis in making sense of our results. Such analysis would allow us to better understand the
“rank ordering for the relative importance of individual attributes” (255). Second, we could create a more
manageable test set that focuses on these attributes. By limiting the number of representative instances
we can slow the combinatorial explosion of interactions in the model. Finally, we may obtain the most
optimal solution by satisfying. With a large number of variables and interactions conversion to a certain
solution may be impractical (and not very meaningful). By putting a limiting value – such as a set number
of iterations and passes or a “good enough” node output score – on the network we may actually end up
with a better answer.
C. They can be “over trained to the point of working well on the training data but poorly on test data”:
One of the great perils of neural networks is that the network may be finely tuned to the training data but
work poorly on the training data. This can be attributed to more than just poor instance choices. Neural
networks are only as good as their training data, and training data may be limited in the real world. More
so, small changes in the training set or weights – even due to random changes – can cause drastic
changes (WWW7).
An example of this can be found in a small, new house loan business. Such an organization, even run by
experienced individuals, might want to use neural networks to see what clients would be likely to default
or pay off loans. Based on their small client base (N=10) such a network would be limited and some input
variables could dominate the model. As more clients are fed into the network it will evolve and the older
outputs change in ways that are not totally predictable. Thus, each client may make the model better or
(as an outlier) destroy it.
To mitigate these problems we can “constantly measure test set performance” (257). Once again,
sensitivity analysis would give us a better idea how a model changes by adding one or more entries to the
input. Second, if a model is generally unstable statistically techniques to dampen the noise or algorithms
like “bagging” or sigmoid function to smooth out problems. Finally, the training data can be altered to
better reflect the true nature of the sample. Another option is running different equally sized sample from
the population and averaging the values to minimize error.
WWW4 = http://www.netnam.vn/unescocourse/knowlegde/65.htm
WWW5 = http://ailab.ch/teaching/classes/2004ss/nn/lecture1.pdf
WWW6 = http://www.megaputer.com/dm/systems.php3
WWW7 = http://www.predictiondynamics.com/education/es1999-3.pdf