Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
資料探勘與知識發現~期末報告 Using Adult DB to Predict Yearly Salary Greater or Less than 50K in 1994 Adviser : Prof. San-Yih Hwang Student : Hung, I-Chun (b924020024) Lee, Nan-Kuang (b924020031) Tseng, Shih Hui (b924020036) 資料探勘與 知識發現 Using Adult DB to Predict Yearly Salary Greater or Less than 50K in 1994 Adviser: Prof. San-Yih Hwang Student: Hung, I-Chun (b924020024) Lee, Nan-Kuang (b924020031) Tseng, Shih Hui (b924020036) VÉÇàxÇà A.Clarify the motivation and the background .......................................2 B.Step 1:Translate business problem into DM problem.....................2 C.Step 2:Select appropriate data ..........................................................2 D.STEP 3:Get to know the data............................................................3 E.Step 4:Create a Model Set................................................................10 F.Step 5:Fix Problems with the Data ..................................................10 G.Step 6:Transform Data to Bring Information to the Surface.......11 H.Step 7:Build Models .........................................................................12 I.Step 8:Access Models.........................................................................13 J.Conclusion ............................................................................................14 1 A.Clarify the motivation and the background Motivation Because we are the graduate student, most of us ate going to work. For understanding how to prepare to earn higher salary, we found this data, and try to find the model to learn. Background Data Extraction was done by Barry Becker from the 1994 Census database (http://www.census.gov/). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)) B.Step 1:Translate business problem into DM problem Identify the people whose salaries are greater or less than 50K. In the step one, we do the translation to make original problem turn into a concrete goal. So, we use “classification” data mining techniques to do the direct data mining. This process of building a classifier starts with examples of records that have already correctly classified. And in the result, we expect our model can reach higher correct prediction rate (lower error rate). C.Step 2:Select appropriate data Indication of what attributes were being predicted Salary greater or less than 50K. Data Extraction was done by Barry Becker from the 1994 Census database (http://www.census.gov/). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). At the same time, we set “greater or less than 50K” about salary as the class label. Attributes Age continuous Work class Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked Fnlwgt continuous 2 Attributes Education Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool Education-num continuous Marital-status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse Occupation Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces Relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried Race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black Sex Female, Male Capital-gain continuous Capital-loss continuous Hours-per-week continuous Native-country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands Class Label Class >50K, <=50K D.STEP 3:Get to know the data This data set is reasonable and able to use. For example, We draw a histogram (Figure 1) contains number of people and age as follows. It shows up many people are 18-50 years old and this range represents a normal working age. According to its distribution, we think it’s reasonable. The followings are distributions or relations figures based on each attribute and people amount. 3 1000 900 800 700 ˠᇴ 600 500 400 300 200 100 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 年齡 Figure 1: It shows the age distribution. We found most people fall on the 16-60 interval. In the reality, this situation is reasonable. 25000 20000 15000 ˠᇴ 21790 10000 10771 5000 0 Male Female ّҾ Figure 2: It shows the Male and Female ratio is about 2:1. It’s reasonable. In the reality, male of working percentage is higher than female. 4 90 95 100 12000 10501 10000 7291 8000 ˠᇴ 5355 6000 4000 2000 646 514 51 168 333 1175 933 1382 1067 433 1723 576 413 0 te ra to oc D rs te as M s or el ch ge Ba lle co em So ad gr Sl H oo ch -s of Pr dm ac cso As c vo cso As th 12 th 10 th 11 h 9t h 8t h7t h 6t h5t th t-4 1s l oo ch es Pr ጯ። Figure 3: In the data we got, most of people have HS-grad or better than it. 16000 14976 14000 10683 12000 10000 ˠᇴ 8000 6000 4443 4000 1025 2000 993 418 23 0 d ed w te sp drie ar o id ra pa se -a e t en bs se us po -s ou u po d rie ar -m d AF drie ar M M W Se er ev N ce or iv D s vci drie ar M ېބၗ Figure 4: The most part is the people who married. The second part is the people who are single. 5 16000 14000 12000 ˠᇴ 10000 8000 6000 4000 2000 0 0 10 20 30 40 50 60 70 80 90 100 Տฉ̍үॡᇴ Figure 5: Most part of this data shows a common situation 40hrs per week. It accords with the reality. 25000 22696 20000 15000 ˠᇴ 10000 5000 2541 2093 1116 960 1298 14 0 N ev d ke or -w er ay -p 6 t ou Figure 6: The type of “Private” work has high percentage. ith W v go e- ov -g al at St c Lo ov l-g ra de Fe c in pem lfSe nc t-i no pem lfSe e at iv Pr ̍үᙷݭ 7 9000 8201 8000 7000 6302 6000 ˠᇴ 5235 5000 4000 3398 3000 2741 2272 2005 2000 1143 1000 621 2 1 2 1 1 0 1 0 2 <1 45 0K 3 <1 35 0K 3 <1 25 0K 4 <1 15 0K 4 <1 05 0K 5 <9 50 K 9 <8 50 K <7 50 K <6 50 K <4 50 K <3 50 K <2 50 K <1 50 K <5 0K 0 <5 50 K 271 148 82 42 36 26 Gomxhu Figure 7: Fnlwgt and people amount. Almost all people are fall on 0~450K interval. 4500 4000 3500 3000 ˠᇴ 2500 2000 1500 1000 500 0 f ra u -s ch s ce or -F rv ed m se Ar etiv v ec er ot -s Pr se g ou in -h ov iv t-m Pr or sp ng hi an fi s Tr gin l rm a Fa ric ct le sp -c -in m p Ad -o rs ne ne hi ea ac cl M ser dl y t an al H ci l pe ia -s er of ag Pr an -m ec Ex s e le ic Sa erv -s er th r O ai ep t-r t or pp C Te ᖚຽ Figure 8: Occupation and people amount. The data we got contains 14 occupations. 7 14000 12000 10000 8000 ˠᇴ 6000 4000 2000 0 Wife Own-child Husband Not-in-family Other-relative Unmarried ᙯܼ Figure 9: Family character and people amount. As the figure2 shows, male’s working percentage is higher. So, in this figure “Husband” label has the most amounts. 30000 25000 20000 ˠᇴ 15000 10000 5000 0 White Asian-Pac-Islander Amer-IndianEskimo Black Other Figure 10: In this data we got, the white people are a large part. 8 Ϡ г ˠ ᇴ ͧ ּ United-States Cambodia England Puerto-Rico Canada Germany Outlying-US(Guam-USVI-etc) India Japan Greece South China Cuba Iran Honduras Philippines Italy Poland Jamaica Vietnam Mexico Portugal Ireland France Dominican-Republic Laos Ecuador Taiwan Haiti Columbia Hungary Guatemala Nicaragua Scotland Thailand Yugoslavia El-Salvador Trinadad&Tobago Peru, Hong Holand-Netherlands Figure 11: As the Figure10 shows, the most people’s native country is US. 30000 29849 25000 20000 ˠᇴ 15000 10000 5000 1942 517 0 0K 10K 20K 87 30K 5 2 0 0 0 0 159 40K 50K 60K 70K 80K 90K 100K dbqjubm.hbjo Figure 12: Most people’s capital-gain is 0. It represent most people don’t have extra profit by selling assets. 9 35000 31042 30000 25000 20000 ˠᇴ 15000 10000 5000 0 5 0K 6 0.5K 15 10 1K 9 91 1.5K 374 684 137 2K 144 21 2.5K 12 3K 0 2 3.5K 2 4 4K 0 3 4.5K dbqjubm.mptt Figure 13: Most people’s capital-loss is 0. It represent most people don’t have extra loss by selling assets. E.Step 4:Create a Model Set 1. 2. 3. The data contains many personal characteristics, and most studies think there are possible relationships between personal salary and these characteristics. The quantity of data is large, so we suppose the data is balanced.(*The quality is 32500) Since this data set is from 1994, we use the data from the same year to be the test data. We will try to find the other year data to be the future research. F.Step 5:Fix Problems with the Data Categorical variables with too many values z The education variable is transferred to education-num variable. Since the 12-year compulsory education in American, we suppose that people with unfinished compulsory education are most earning lower salary(No conspicuous difference.) We grouped the classes from Preschool to 12th. 10 z And since the same meaning of education and education-num, we decide delete the variable education-num to avoid double use. 12th 11th 10th 9th 7th-8th 5th-6th 1st-4th Preschool Unfinished Numeric values with skewed distribution and outliers We try to find any outliers in these attributes. We use the mean and standard deviation to calculate every numeric attributes to find outliers. Then we find the age exist the outliers. z Age (mean: 38.4, standard deviation:13.37) 38.4+3*13.37 = 78.51 38.4-3*13.37 = -1.71 We remove the age more than 80 years old. Missing value We replace the missing value with the average value of every attributes. z Categorical Variables:the attribute that appear most times z Numeric Variables: the average value Since the quantity of the data is large, the problems won’t cause many trouble. G.Step 6:Transform Data to Bring Information to the Surface Capture Trends Since the data is just the collection of one year, and our research time series is year. We didn’t do this step. Create Rations and Other combination of Variables We didn’t find relative rations about this mining research, so we didn’t do the step too. Convert Counts to Proportions The data don’t contain counts, so we ignore this step. 11 H.Step 7:Build Models We used RapidMiner to help us to mine the data, and the version is 4.0 beta. Figure 14 is what the software look like. Figure 14 We used the CSV file which we edited as the source. After the software filtrated out missing values, we had the decision tree like Figure 15. Figure 15 12 As picture, this decision tree is very unbalance and left-leaning. Figure 16 is part of the decision tree which near root part. Figure 16 We can see that the most conspicuous indicator is capital-gain, while capital-gain > 7073.5 and age > 21, the record will be labeled by “> 50K”. And if age < 21, the record will be labeled by “<= 50K”. On the other side, to label whether “> 50K” or “<= 50K” is not that easy though. ※ (3/0) means there are 3 records were labeled by <= 50K and it is true. And (6/579) means there are 579 records were labeled by > 50K and it is true but 6 are false. I.Step 8:Access Models We used part of our source data as the test data. And the result is follow. True <= 50K True > 50K class precision Pred. <= 50K 16991 861 95.18% Pred. > 50K 7624 6964 47.74% class recall 69.03% 89.00% 24615 7825 SUM class precision = 16991 / (16991+861) = 95.18% = 6964 / (7624+6964) = 47.74% class recall = 16991 / (16991+7624) = 69.03% = 6964 / (861+6964) = 89.00% 32440 Accuracy = (16991+6964) / (16991+6964 + 861+7624) = 73.84% Lift = 191.97% According the table, the accuracy of our decision tree is 73.84%. And it is good at prediction the record which is true <= 50K. Also, the Lift value is 191.97%, it means the more the better! 13 J.Conclusion 1. Comparing the mining result and our pre-assumption. Before mining, we assume the most conspicuous indicator is education, since the direct thinking. But the mining result is capital-gain, so we think that people make more money who can use more money to invest. 2. In the mining process, we have learned that it’s better to understand the meaning of all attributes, or the process will block very often. 3. Fix data problem is a very important step. After fixing data problem, the mining result will be more accuracy. 14