Download CIS 498 Exercise for Salary Census Data (U.S. Only) Part I: Prepare

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
CIS 498 Exercise for Salary Census Data (U.S. Only)
Part I: Prepare the data source, mining structure, and mining models
Use BI developer Studio for this exercise.
Using the Salary Census Training data, start by creating a Data Source View that is based
on a named query focusing only on people whose native country is the United States.
Then create a mining structure with the following mining models: Naïve Bayes, Decision
Tree, and Clustering. Assume that all attributes (other than the key) are used for input,
except for the Salary attribute, which should be Predict Only in all models. Note: some
models will ignore some of the input attributes.
You should notice that several attributes are ignored in the default models. This is
because they are continuous, and the models do not handle continuous variables. So,
change the age, capital gains, capital loss, and hours per week attributes to Discretized,
and set their bucket counts to 4, with discretization method set to "equal areas".
After you do this, go into each mining model and change the usage of these four
attributes from "Ignore" to "Input".
Also, for the Decision Tree, change the Algorithm Parameters. Set the Complexity
Penalty to 0.1. This will allow more attributes to be included in the decision tree.
Deploy and process these models.
Part II: Analyze Mining Models
Use the mining models to answer the following questions. As you answer these
questions, capture screen images and paste them into the document as justification for
your answers.
1) According to the Naïve Bayes algorithm, using the default algorithm parameter
settings, which input attribute has the strongest relationship with the salary prediction
output, according to the Dependency Network view?
2) Based on the Naïve Bayes algorithm, what is the value for this particular attribute,
what value is most associated with having a >50K salary, according to the Attribute
Profile view?
3) Based on the Naïve Bayes algorithm, what are the are the three most important factors
that support having a >50K salary, according to the Attribute Characteristics view?
4) Based on the Decision Tree algorithm, what are the two most important attributes for
predicting salary, according to its Dependency Network?
5) Based on the Decision Tree algorithm, viewing the first two levels of the decision tree
itself, what values for the two attributes are most likely to lead to a salary >50K? What
values will most likely lead to a salary <=50K?
6) Using the Clustering algorithm, look at the Cluster Profile view. Consider the cluster
that has the highest concentration of >50K salaries. In this cluster, what are the most
predominant characteristics?
7) Using the Clustering algorithm, look at the Cluster Discrimination view. Using the
cluster that you had found from the Cluster Profile view with the highest concentration of
>50K salaries, what are the three factors that most strongly favor this cluster over any
other?
8) Based on these algorithms, what would you say are the best ways to ensure that a
person would make over 50K salary?
Part III: Predicting Salaries Without Considering Gender, Marital Status,
Relationship, Nationality, or Age
At this point, you should have seen that main factors in determining salary are items like
age, gender, and marital status. Now let's consider what happens when we ignore these
attributes. Change the usage of Sex, Marital Status, Relationship, Nationality, and Age to
Ignore, then answer the following questions (again, pasting screen images to fortify your
answers).
1) What are the three most important attributes for predicting salary in the Naïve Bayes
algorithm?
2) What are the values for these attributes that best determine that a person will make
>50K in the Naïve Bayes algorithm?
3) What are the three most important attributes for predicting salary in the Decision Tree
algorithm?
4) What are the values for these attributes that best determine that a person will make
>50K in the Decision Tree algorithm?
5) ) Using the Clustering algorithm, look at the Cluster Profile view. Consider the cluster
that has the highest concentration of >50K salaries. In this cluster, what characteristics
that most favor the cluster?
Part IV: Performing Further Analysis
Use other mining models (such as Association rules and/or Neural Networks), parameter
settings, choices of input attributes, etc. to perform more analysis of this data set. Come
up with a few of your own questions and answers, again using screen images.