Download Monica Nusskern Week 9 Final Exam

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of wildlife tracking technology wikipedia , lookup

Total Information Awareness wikipedia , lookup

Data center wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Transcript
Monica Nusskern
CSIS 5420 – Data Mining
Week 9 Final Exam
July 26, 2005
1
Total Score 130 out of 130
Score 50 out of 50
1. (50 Points) Data pre-processing and conditioning is one of the key
factors that determine whether a data mining project will be a success. For
each of the following topics, describe the affect this issue can have on our
data mining session and what techniques can we use to counter this
problem.
a. Noisy data – Often times when data mining a database, many attribute values
can be erroneous. Human error is one of the many reasons that this can occur.
Training set values can be changed from exactly what they should have been.
This can lead to database fields becoming contradictory to specified database
rules. Noisy values may cause a database to either ignore the data or to take
these values into account and possibly change the values into what the database
believes are the correct pattern. A major issue is that noisy values are hard to
find within a large database. Techniques that can be used to counter noisy data
include: data preprocessing, regression (smoothing data values), removing
outliers, and combining computer and human inspection. Good
b. Missing data – Data is not always available when data mining a large
database. Missing data can be due to:
 Data wasn’t captured due to equipment malfunction
 Data is inconsistent with other recorded data and thus application
program might have deleted the data
 Data was data not entered due to misunderstanding
 Certain data may not be considered important at the time of entry
Good
Missing data values need to then be inferred or estimated. Techniques to
assist with eliminating missing data include:

 Ignore the Missing Data: This may easy but not effective when the
percentage of missing values per attribute varies considerably. Good
 Fill in the Missing Value Manually: This can be tedious and infeasible.
Good
 Use a Constant to Fill in the Missing Values: For example, use “unknown”
as the value for all missing data Good
 Use the Attribute Mean to Fill in the Missing Value: This can be used if
the attribute is numeric or the majority value if attribute it numeric or
categorical. Good
 Use the Attribute Mean for all Samples Belonging to the Same Class to Fill
in the Missing Value: This is often an effective means for filling in missing
values. Good
Monica Nusskern
CSIS 5420 – Data Mining
Week 9 Final Exam
July 26, 2005
2
 Use the Most Probable Value to Fill in the Missing Value: Examples
include the Bayesian formula or a decision tree. Good
c. Data normalization and scaling - Data normalization is a frequent data
transformation methodology that allows the numeric data values to be converted
so that all fall within a particular range. Neural networks work well when the
numerical data lies between 0 and 1. This method is effective with distancebased classifiers, because fields with a wide array of values are much less likely
to overshadow attributes with smaller ranges. Good Decimal scaling divides
each numerical value by the same power of 10. For example, if all data values
fall between –100 and 100, this assortment can be transformed to values
between –1 and 1 by dividing each value by 100. Good
d. Data type conversion - Categorical data cannot be processed by a variety of
data mining tools such as neural networks. Converting the categorical alpha data
to a numeric equivalent is a frequent data transformation methodology. Good
Also numeric data may not be able to be processed with data mining. Decision
trees sort numeric values then determine other values for the data. Good
e. Attribute and instance selection - Some data mining techniques are not able
to evaluate data with too few attributes and others cannot assess data with large
attribute numbers. Determining attributes to be relevant or irrelevant is quite a big
problem in data mining. Irrelevant attribute amounts affect the number of training
instances necessary to construct a supervised learner model. Eliminating
attributes works well with neural networks and nearest neighbor classifiers,
however attribute selection has to occur before the data mining process can
begin. Attributes with little predictive power may be combined with other
attributes to create new attributes with a high degree of predictive capability.
Good Means of completing this task include:
 Creating a new attribute where each value represents a ratio of the
value of one attribute divided by the value of a second attribute. Good
 Creating a new attribute where values are differences between the
values of two existing attributes. Good
 Creating a new attribute where values are computed as the percent
decrease or increase of two or more current values. Good
Score 80 out of 80
Monica Nusskern
CSIS 5420 – Data Mining
Week 9 Final Exam
July 26, 2005
3
2. (80 points) We've covered several data mining techniques in this course.
For each technique identified below, describe the technique, identify which
problems it is best suited for, identify which problems it has difficulties
with, and describe any issues or limitations of the technique.
a. Decision trees - An initial tree is created when a common algorithm for
constructing a decision tree selects a subset of instances from the training data.
The exactness of this tree is examined by the other training instances. Incorrectly
classified instances will be added to the current set of training data and the
process continues to repeat. Decision trees are fairly easy to comprehend and
are able to be mapped to a distinct group of production rules that can be easily
be applied to real-world issues. Decision trees do not make prior assumptions
about the data. Decision trees are able to work with both numerical and
categorical data. Some issues with decision trees are output attributes must be
categorical as well as multiple output attributes are not allowed. Only minor
differences in training data can cause different attribute selections at each choice
point within the decision tree allowing the tree to be unstable. Good
b. Association rules - Association rules assist with discovering unknown
relationships in larger databases. An attribute that is a precondition in one rule
may appear as a consequence in another rule. Association rules permit the
consequences of a rule to contain one or more values. Rule confidence is utilized
and supported to assist in uncovering associations that are likely to be interesting
to marketing departments. A major issue with association rules is that some
discovered relationships might turn out to be trivial so a person must utilize
careful interpretation of the results. Good
c. K-Means algorithm - The K-Means algorithm is among the most popular
clustering techniques utilized. K-Means algorithm is defined as a statistical
unsupervised clustering mechanism. Utilizing only numeric input attributes, a
person must make a decision concerning the number of clusters that need to be
discovered. The K-Means algorithm is started by randomly determining one
unique data point to represent each cluster and continues with each data
instance being placed in the cluster where it is most similar to. New cluster
centers are created and the cluster centers do not change. The K-Means
algorithm does not assure convergence on an optimal solution. Also, it is
deficient in the ability to clarify what has been formed. Finally, the K-Means
algorithm is not able to determine which attributes are significant in determining
the formed clusters. Good
d. Linear regression - Linear regression can be an accurate method for
prediction and estimation problems. It models variations in dependent variables
as a linear combination with one or more independent variables. It is an effective
Monica Nusskern
CSIS 5420 – Data Mining
Week 9 Final Exam
July 26, 2005
4
data mining technique if the relationship between the dependent and the
independent variable is linear and not as much so when the outcome is binary in
nature. This is due to the restrictions that are placed on the value of the
dependent variable that cannot be determined by the equation. Good
e. Logistic regression – Logistic regression may be considered the opposite of
linear regression in that logistic regression is often an excellent choice for
problems having a binary result. Logistic regression is a nonlinear regression
methodology that connects a conditional probability value with each instance of
the data. Logistic regression allows for output values to represent a probability of
class membership whereas linear regression forces output values to be between
0 and 1. Good
f. Bayes classifier - Bayes classifier is a supervised classification methodology
where all input attributes are assumed to be of equal importance and also
independent of each other. This method can still work effectively even when the
assumptions are determined to be false. A benefit of Bayes classifiers is that it
may be applied to datasets that contain both categorical and numeric data as
well as datasets that are missing information. Good
g. Neural networks - Neural networks are defined as a group of interconnected
nodes designated to imitate the function of the human brain. Neural networks can
be utilized for both supervised and unsupervised clustering which is a unique
characteristic. However, the input values are always be numeric. The phases of a
neural network include the Learning Phase where input values are connected
with each instance that enters the network at the input layer. For each input
attribute contained in the data there is one input layer node. Utilizing input values
and the network connection is able to determine the neural network computes for
the output for each instance. The output for each instance is compared with the
desired network output. Training is complete when specific number of iterations
is defined or when the network meets a predetermined minimum error rate. The
second phase of neural networks is when the entire network fixes the weights
and then uses the results to compute output values for the new instances. Good
h. Genetic algorithms - Genetic algorithms take the theory of evolution to
include inductive leaning. It has the ability to be used as a supervised or as an
unsupervised technique. Genetic algorithms are usually applied to problems that
cannot be solved with more traditional methods. A fitness function is then
applied to a data element set to specify the elements that will survive from one
generation to the next. New instances are defined from those elements not
surviving in order to replace deleted elements. Good
Monica Nusskern
CSIS 5420 – Data Mining
Week 9 Final Exam
July 26, 2005
Very good overview!
5