Download CIS 498 Exercise for Salary Census Data (U.S. Only) Part I: Prepare

CIS 498 Exercise for Salary Census Data (U.S. Only) Part I: Prepare the data source, mining structure, and mining models Use BI developer Studio for this exercise. Using the Salary Census Training data, start by creating a Data Source View that is based on a named query focusing only on people whose native country is the United States. Then create a mining structure with the following mining models: Naïve Bayes, Decision Tree, and Clustering. Assume that all attributes (other than the key) are used for input, except for the Salary attribute, which should be Predict Only in all models. Note: some models will ignore some of the input attributes. You should notice that several attributes are ignored in the default models. This is because they are continuous, and the models do not handle continuous variables. So, change the age, capital gains, capital loss, and hours per week attributes to Discretized, and set their bucket counts to 4, with discretization method set to "equal areas". After you do this, go into each mining model and change the usage of these four attributes from "Ignore" to "Input". Also, for the Decision Tree, change the Algorithm Parameters. Set the Complexity Penalty to 0.1. This will allow more attributes to be included in the decision tree. Deploy and process these models. Part II: Analyze Mining Models Use the mining models to answer the following questions. As you answer these questions, capture screen images and paste them into the document as justification for your answers. 1) According to the Naïve Bayes algorithm, using the default algorithm parameter settings, which input attribute has the strongest relationship with the salary prediction output, according to the Dependency Network view? 2) Based on the Naïve Bayes algorithm, what is the value for this particular attribute, what value is most associated with having a >50K salary, according to the Attribute Profile view? 3) Based on the Naïve Bayes algorithm, what are the are the three most important factors that support having a >50K salary, according to the Attribute Characteristics view? 4) Based on the Decision Tree algorithm, what are the two most important attributes for predicting salary, according to its Dependency Network? 5) Based on the Decision Tree algorithm, viewing the first two levels of the decision tree itself, what values for the two attributes are most likely to lead to a salary >50K? What values will most likely lead to a salary <=50K? 6) Using the Clustering algorithm, look at the Cluster Profile view. Consider the cluster that has the highest concentration of >50K salaries. In this cluster, what are the most predominant characteristics? 7) Using the Clustering algorithm, look at the Cluster Discrimination view. Using the cluster that you had found from the Cluster Profile view with the highest concentration of >50K salaries, what are the three factors that most strongly favor this cluster over any other? 8) Based on these algorithms, what would you say are the best ways to ensure that a person would make over 50K salary? Part III: Predicting Salaries Without Considering Gender, Marital Status, Relationship, Nationality, or Age At this point, you should have seen that main factors in determining salary are items like age, gender, and marital status. Now let's consider what happens when we ignore these attributes. Change the usage of Sex, Marital Status, Relationship, Nationality, and Age to Ignore, then answer the following questions (again, pasting screen images to fortify your answers). 1) What are the three most important attributes for predicting salary in the Naïve Bayes algorithm? 2) What are the values for these attributes that best determine that a person will make >50K in the Naïve Bayes algorithm? 3) What are the three most important attributes for predicting salary in the Decision Tree algorithm? 4) What are the values for these attributes that best determine that a person will make >50K in the Decision Tree algorithm? 5) ) Using the Clustering algorithm, look at the Cluster Profile view. Consider the cluster that has the highest concentration of >50K salaries. In this cluster, what characteristics that most favor the cluster? Part IV: Performing Further Analysis Use other mining models (such as Association rules and/or Neural Networks), parameter settings, choices of input attributes, etc. to perform more analysis of this data set. Come up with a few of your own questions and answers, again using screen images.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CIS 498 Exercise for Salary Census Data (U.S. Only) Part I: Prepare