Download This examination will provide you an opportunity to synthesize and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Total Score 120 out of 120
Karen Cegelski
Mid Term Week 5
CSIS 5420
June 30, 2005
Score 20 out of 20
1. In a supervised learner model we use a training dataset and a
test dataset. The training dataset is used to train and refine the
model. The test dataset represents any data we will submit to the
model for classification. What constraints must we require that the
training dataset conform to so that any test data will be classified
correctly?
Poor test set accuracy may lie in problems with the training data. If
the model is built with training data that does not contain all
possible domain instances or contains an abundance of atypical
instances then it is not likely to perform well. Randomly selecting
training data ensuring that all classes contained in the training data
are equally distributed, as they would be seen in the general
population is the best preventative method toward ensuring test set
accuracy. This procedure for is known as stratification. Another
useful technique involves careful examination of training instance
typicality scores. Good
Training set data must be proportionate to the test set data: models built with
training data that does not represent the set of all possible domain instances can
lead to misclassified test data (215).
Redundant attributes must be addressed: redundant attributes can certainly wreak
havoc upon output results (228).
Missing values must be addressed: in the data preprocessing phase decisions
need to be made as to how to handle missing or null attribute values on an entry
by entry level (155).
Input must be meaningful and error minimized: we have a variety of error checking
and process simplifying mechanisms at our disposal (129).
Score 20 out of 20
2. In unsupervised clustering we adjust, or the data-mining tool
adjusts, a cluster membership fitness threshold value. This value
determines whether a piece of data will belong, or not belong, to a
particular cluster. Tuning this value adjusts the number of clusters a
model will partition the training data into. Why is it important to tune
the number of clusters that our unsupervised clustering model will
create?
Without a dependent variable to guide the learning process, we
must rely on the learning program to build a knowledge structure by
using some measure of cluster quality to group instances into two
or more classes. The primary goal of unsupervised clustering is to
discover concept structure in the data. Common uses of
unsupervised clustering include:




Determine if meaningful relationships in the form of concepts
can be found in the data Good
Evaluate the likely performance of a supervised learner
model Good
Determine a best set of input attributes for supervised
learning Good
Detect outliers. Good
Varying the number of clusters can assist in detecting any atypical
instances present in the data by examining those instances that do
not group naturally with other instances. Good. The main goal of
unsupervised clustering is to create K clusters where the
entries within each cluster are very similar, but the clusters are
different from one another.
Score 20 out of 20
3. The K-Means algorithm has some issues that can cause poor or
incorrect results. What are these issues? What data preprocessing
can we use to either detect or avoid each issue?
This method is early to understand and implement. However,
several issues to be considered are:


The algorithm only works with real-valued data.
Categorical attributes in the data set must either be
discarded or converted to a numerical equivalent. Good
A value for the number of clusters formed is required.
Making a poor choice is an obvious problem. Running the
algorithm several times using different values for K can



assist in determining how many clusters may be present
in the data. Good
K-Means algorithm works best when all the clusters that
exist in the data set are approximately the same size.
Clusters of unequal size mean that the K-Means
algorithm may not find the best solution. Good
There is not way to determine which attributes are
significant in determining the formed clusters. Irrelevant
attributes can cause less than optimal results. Good
We are responsible for interpretation about what has
been found due to the lack of explanation about the
nature of the formed clusters. Supervised data mining
tools can help gain insight into the nature of the clusters
formed by unsupervised clustering algorithms. Good
Score 20 out of 20
4. Genetic algorithms are powerful tools but they have some issues
that affect their results or their performance. What are these
issues? What methodologies can we use to either mitigate or avoid
each issue?
Genetic algorithms are gaining in popularity; however, few
commercial data mining products contain this component. They can
be used for both supervised and unsupervised learning and is
typically used in situations where traditional techniques will not
solve the problem. However, using this problem solving approach
has other issues you need to be aware of:


Genetic algorithms are designed to locate globally optimized
solutions. There are no guarantees that any given solution is
not the result of a local rather than a global optimization.
Properly handling the fitness function can be used to ensure
global optimization. Good
The fitness function determines the computational
complexity of a genetic algorithm. If the fitness function
involves several calculations, it can be computationally
expensive. Operating on dynamic data sets is difficult, as
early convergeance towards solutions may no longer be
valid for later data. A possible remedy would be increasing
genetic diversity and attempt to prevent early convergence,
by either increasing the probability of mutation when the
solution quality drops or by occasionally introducing new,
randomly generated elements into the gene pool Good


GAs can not effectively solve problems in which there is no
way to judge the fitness of an answer other than right/wrong,
as there is no way to converge on the solution. Good
Transforming the data to a form suitable for a genetic
algorithm can be challenging. Good
Score 20 out of 20
5. The quality of a data mining model is highly dependent on the
data used to train or develop the model. Discuss the techniques
used to preprocess our data prior to data mining.
Data preprocessing begins with data cleaning. This involves
accounting for noisy data and dealing with missing information.
Ideally data preprocessing would take place prior to the data being
stored permanently in the data warehouse. Good
Noisy data refers to random errors in the attribute values. In large
datasets, noise can be found in many shapes and forms. These
concerns can include: 1) How to we find duplicate records?; 2) How
can we locate incorrect attribute values?; 3) What data smoothing
operations should be applied to the data?; and 4) How can we find
and process outliers?. Good
Missing data items can be dealt with in various ways. Frequently,
missing attribute values indicate the information is lost. Most data
mining techniques require all attributes contain a value. Some
solutions that for dealing with missing data prior to being presented
to the data mining algorithm are: 1) Discard records with missing
data – this should only be used when a small percentage of the
total number of instances contain missing data; 2) For real-value
data, replace missing values with the class mean – this is an option
for numerical attributes; and 3) Replace missing attribute values
with the values found within other highly similar instances – this
method can be used for either categorical or numeric attributes.
Good
Some techniques allow for instances to contain missing values.
Three ways that these techniques deal with the missing data are: 1)
Ignore the missing values – neural networks and Bayes classifier
use this approach; 2) Treat missing values as equal comparisons –
the approach can present problems with very noisy data in that
instances may appear to be alike when in reality they are not; 3)
Treat missing values as unequal comparisons – in this case two
similar instances containing missing values may appear dissimilar.
Good
Score 20 out of 20
6. Neural networks have three significant weaknesses. What are
they? Give an example that demonstrates each weakness. How
can we mitigate each weakness?



One of the biggest criticism of neural networks is they lack the
ability to explain their behavior. In areas where explaining rules is
important, such as denying loan applications, neural networks
should not be the tool of choice. Neural Networks should be used
when acting on the results is more important than understanding
them. Good
The learning algorithms associated with neural networks are not
guaranteed to converge to an optimal solution. This problem can be
dealt with by manipulating various learning parameters. The inputs
to a neural network can be massaged to be in a particular range,
usually between 0 and 1. Sometimes this choice can have an effect
on the results. Massaging the data requires analyzing the training
set to verify the data values and their ranges. Since data quality is
the number one issue in data mining, the necessity of this step can
possibly prevent problems later in the analysis. Good
Neural networks can be over trained to where they will work well on
training data, but poorly on test data. Test set performance must be
consistently monitored in order to watch for this problem. Good