Download Using Adult DB to Predict Yearly Salary Greater or Less than 50K in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
資料探勘與知識發現~期末報告
Using Adult DB to
Predict Yearly Salary
Greater or Less than 50K
in 1994
Adviser : Prof. San-Yih Hwang
Student : Hung, I-Chun (b924020024)
Lee, Nan-Kuang (b924020031)
Tseng, Shih Hui (b924020036)
資料探勘與
知識發現
Using Adult DB to Predict Yearly Salary
Greater or Less than 50K in 1994
Adviser: Prof. San-Yih Hwang
Student: Hung, I-Chun (b924020024)
Lee, Nan-Kuang (b924020031)
Tseng, Shih Hui (b924020036)
VÉÇàxÇà
A.Clarify the motivation and the background .......................................2
B.Step 1:Translate business problem into DM problem.....................2
C.Step 2:Select appropriate data ..........................................................2
D.STEP 3:Get to know the data............................................................3
E.Step 4:Create a Model Set................................................................10
F.Step 5:Fix Problems with the Data ..................................................10
G.Step 6:Transform Data to Bring Information to the Surface.......11
H.Step 7:Build Models .........................................................................12
I.Step 8:Access Models.........................................................................13
J.Conclusion ............................................................................................14
1
A.Clarify the motivation and the background
‹
Motivation
Because we are the graduate student, most of us ate going to work. For
understanding how to prepare to earn higher salary, we found this data, and try to find
the model to learn.
‹
Background
Data Extraction was done by Barry Becker from the 1994 Census database
(http://www.census.gov/). A set of reasonably clean records was extracted using the
following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) &&
(HRSWK>0))
B.Step 1:Translate business problem into DM problem
‹
Identify the people whose salaries are greater or less than 50K.
In the step one, we do the translation to make original problem turn into a
concrete goal. So, we use “classification” data mining techniques to do the direct data
mining. This process of building a classifier starts with examples of records that have
already correctly classified. And in the result, we expect our model can reach higher
correct prediction rate (lower error rate).
C.Step 2:Select appropriate data
‹
Indication of what attributes were being predicted Salary greater or less
than 50K.
Data Extraction was done by Barry Becker from the 1994 Census database
(http://www.census.gov/). A set of reasonably clean records was extracted using the
following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) &&
(HRSWK>0)). At the same time, we set “greater or less than 50K” about salary as the
class label.
Attributes
Age
continuous
Work class
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,
State-gov, Without-pay, Never-worked
Fnlwgt
continuous
2
Attributes
Education
Bachelors, Some-college, 11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool
Education-num
continuous
Marital-status
Married-civ-spouse, Divorced, Never-married, Separated,
Widowed, Married-spouse-absent, Married-AF-spouse
Occupation
Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspct,
Adm-clerical, Farming-fishing, Transport-moving,
Priv-house-serv, Protective-serv, Armed-Forces
Relationship
Wife, Own-child, Husband, Not-in-family, Other-relative,
Unmarried
Race
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Sex
Female, Male
Capital-gain
continuous
Capital-loss
continuous
Hours-per-week
continuous
Native-country
United-States, Cambodia, England, Puerto-Rico, Canada,
Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece,
South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland,
Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,
Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia,
El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
Class Label
Class
>50K, <=50K
D.STEP 3:Get to know the data
‹
This data set is reasonable and able to use.
For example, We draw a histogram (Figure 1) contains number of people and age
as follows. It shows up many people are 18-50 years old and this range represents a
normal working age. According to its distribution, we think it’s reasonable. The
followings are distributions or relations figures based on each attribute and people
amount.
3
1000
900
800
700
ˠᇴ
600
500
400
300
200
100
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
年齡
Figure 1: It shows the age distribution. We found most people fall on the 16-60 interval. In the
reality, this situation is reasonable.
25000
20000
15000
ˠᇴ
21790
10000
10771
5000
0
Male
Female
ّҾ
Figure 2: It shows the Male and Female ratio is about 2:1. It’s reasonable. In the reality, male of
working percentage is higher than female.
4
90
95
100
12000
10501
10000
7291
8000
ˠᇴ
5355
6000
4000
2000
646 514
51 168 333
1175 933
1382
1067
433
1723
576
413
0
te
ra
to
oc
D
rs
te
as
M
s
or
el
ch
ge
Ba
lle
co
em
So
ad
gr
Sl
H
oo
ch
-s
of
Pr
dm
ac
cso
As
c
vo
cso
As
th
12
th
10
th
11
h
9t
h
8t
h7t
h
6t
h5t
th
t-4
1s
l
oo
ch
es
Pr
ጯ።
Figure 3: In the data we got, most of people have HS-grad or better than it.
16000
14976
14000
10683
12000
10000
ˠᇴ
8000
6000
4443
4000
1025
2000
993
418
23
0
d
ed
w
te
sp
drie
ar
o
id
ra
pa
se
-a
e
t
en
bs
se
us
po
-s
ou
u
po
d
rie
ar
-m
d
AF
drie
ar
M
M
W
Se
er
ev
N
ce
or
iv
D
s
vci
drie
ar
M
૓‫ېބ‬ၗ
Figure 4: The most part is the people who married. The second part is the people who are single.
5
16000
14000
12000
ˠᇴ
10000
8000
6000
4000
2000
0
0
10
20
30
40
50
60
70
80
90
100
Տฉ̍үॡᇴ
Figure 5: Most part of this data shows a common situation 40hrs per week. It accords with the
reality.
25000
22696
20000
15000
ˠᇴ
10000
5000
2541
2093
1116
960
1298
14
0
N
ev
d
ke
or
-w
er
ay
-p
6
t
ou
Figure 6: The type of “Private” work has high percentage.
ith
W
v
go
e-
ov
-g
al
at
St
c
Lo
ov
l-g
ra
de
Fe
c
in
pem
lfSe
nc
t-i
no
pem
lfSe
e
at
iv
Pr
̍үᙷ‫ݭ‬
7
9000
8201
8000
7000
6302
6000
ˠᇴ
5235
5000
4000
3398
3000
2741
2272
2005
2000
1143
1000
621
2
1
2
1
1
0
1
0
2
<1
45
0K
3
<1
35
0K
3
<1
25
0K
4
<1
15
0K
4
<1
05
0K
5
<9
50
K
9
<8
50
K
<7
50
K
<6
50
K
<4
50
K
<3
50
K
<2
50
K
<1
50
K
<5
0K
0
<5
50
K
271 148
82 42 36 26
Gomxhu
Figure 7: Fnlwgt and people amount. Almost all people are fall on 0~450K interval.
4500
4000
3500
3000
ˠᇴ
2500
2000
1500
1000
500
0
f
ra
u
-s
ch
s
ce
or
-F
rv
ed
m
se
Ar
etiv
v
ec
er
ot
-s
Pr
se
g
ou
in
-h
ov
iv
t-m
Pr
or
sp
ng
hi
an
fi s
Tr
gin
l
rm
a
Fa
ric
ct
le
sp
-c
-in
m
p
Ad
-o
rs
ne
ne
hi
ea
ac
cl
M
ser
dl
y
t
an
al
H
ci
l
pe
ia
-s
er
of
ag
Pr
an
-m
ec
Ex
s
e
le
ic
Sa erv
-s
er
th
r
O
ai
ep
t-r
t
or
pp
C
Te
ᖚຽ
Figure 8: Occupation and people amount. The data we got contains 14 occupations.
7
14000
12000
10000
8000
ˠᇴ
6000
4000
2000
0
Wife
Own-child
Husband
Not-in-family
Other-relative
Unmarried
ᙯܼ
Figure 9: Family character and people amount. As the figure2 shows, male’s working percentage
is higher. So, in this figure “Husband” label has the most amounts.
30000
25000
20000
ˠᇴ
15000
10000
5000
0
White
Asian-Pac-Islander
Amer-IndianEskimo
Black
Other
჌୉
Figure 10: In this data we got, the white people are a large part.
8
΍
Ϡ
г
ˠ
ᇴ
ͧ
ּ
United-States
Cambodia
England
Puerto-Rico
Canada
Germany
Outlying-US(Guam-USVI-etc)
India
Japan
Greece
South
China
Cuba
Iran
Honduras
Philippines
Italy
Poland
Jamaica
Vietnam
Mexico
Portugal
Ireland
France
Dominican-Republic
Laos
Ecuador
Taiwan
Haiti
Columbia
Hungary
Guatemala
Nicaragua
Scotland
Thailand
Yugoslavia
El-Salvador
Trinadad&Tobago
Peru, Hong
Holand-Netherlands
Figure 11: As the Figure10 shows, the most people’s native country is US.
30000
29849
25000
20000
ˠᇴ
15000
10000
5000
1942
517
0
0K
10K
20K
87
30K
5
2
0
0
0
0
159
40K
50K
60K
70K
80K
90K
100K
dbqjubm.hbjo
Figure 12: Most people’s capital-gain is 0. It represent most people don’t have extra profit by
selling assets.
9
35000
31042
30000
25000
20000
ˠᇴ
15000
10000
5000
0
5
0K
6
0.5K
15
10
1K
9
91
1.5K
374
684 137
2K
144
21
2.5K
12
3K
0
2
3.5K
2
4
4K
0
3
4.5K
dbqjubm.mptt
Figure 13: Most people’s capital-loss is 0. It represent most people don’t have extra loss by selling
assets.
E.Step 4:Create a Model Set
1.
2.
3.
The data contains many personal characteristics, and most studies think there are
possible relationships between personal salary and these characteristics.
The quantity of data is large, so we suppose the data is balanced.(*The quality is
32500)
Since this data set is from 1994, we use the data from the same year to be the test
data. We will try to find the other year data to be the future research.
F.Step 5:Fix Problems with the Data
‹
Categorical variables with too many values
z The education variable is transferred to education-num variable. Since the
12-year compulsory education in American, we suppose that people with
unfinished compulsory education are most earning lower salary(No
conspicuous difference.) We grouped the classes from Preschool to 12th.
10
z
And since the same meaning of education and education-num, we decide
delete the variable education-num to avoid double use.
12th
11th
10th
9th
7th-8th
5th-6th
1st-4th
Preschool
‹
Unfinished
Numeric values with skewed distribution and outliers
We try to find any outliers in these attributes. We use the mean and standard
deviation to calculate every numeric attributes to find outliers. Then we find the
age exist the outliers.
z
Age (mean: 38.4, standard deviation:13.37)
38.4+3*13.37 = 78.51
38.4-3*13.37 = -1.71
We remove the age more than 80 years old.
‹
Missing value
We replace the missing value with the average value of every attributes.
z Categorical Variables:the attribute that appear most times
z Numeric Variables: the average value
‹
Since the quantity of the data is large, the problems won’t cause many trouble.
G.Step 6:Transform Data to Bring Information to the Surface
‹
Capture Trends
Since the data is just the collection of one year, and our research time series is
year. We didn’t do this step.
‹
Create Rations and Other combination of Variables
We didn’t find relative rations about this mining research, so we didn’t do the
step too.
‹
Convert Counts to Proportions
The data don’t contain counts, so we ignore this step.
11
H.Step 7:Build Models
We used RapidMiner to help us to mine the data, and the version is 4.0 beta. Figure 14
is what the software look like.
Figure 14
We used the CSV file which we edited as the source. After the software filtrated out
missing values, we had the decision tree like Figure 15.
Figure 15
12
As picture, this decision tree is very unbalance and left-leaning. Figure 16 is part of the
decision tree which near root part.
Figure 16
We can see that the most conspicuous indicator is capital-gain, while capital-gain >
7073.5 and age > 21, the record will be labeled by “> 50K”. And if age < 21, the
record will be labeled by “<= 50K”. On the other side, to label whether “> 50K” or
“<= 50K” is not that easy though.
※ (3/0) means there are 3 records were labeled by <= 50K and it is true. And (6/579)
means there are 579 records were labeled by > 50K and it is true but 6 are false.
I.Step 8:Access Models
We used part of our source data as the test data. And the result is follow.
True <= 50K
True > 50K
class precision
Pred. <= 50K
16991
861
95.18%
Pred. > 50K
7624
6964
47.74%
class recall
69.03%
89.00%
24615
7825
SUM
class precision
= 16991 / (16991+861) = 95.18%
= 6964 / (7624+6964) = 47.74%
class recall
= 16991 / (16991+7624) = 69.03%
= 6964 / (861+6964) = 89.00%
32440
Accuracy = (16991+6964) / (16991+6964 + 861+7624) = 73.84%
Lift = 191.97%
According the table, the accuracy of our decision tree is 73.84%. And it is good at
prediction the record which is true <= 50K. Also, the Lift value is 191.97%, it means
the more the better!
13
J.Conclusion
1.
Comparing the mining result and our pre-assumption.
‹ Before mining, we assume the most conspicuous indicator is education,
since the direct thinking.
‹ But the mining result is capital-gain, so we think that people make more
money who can use more money to invest.
2.
In the mining process, we have learned that it’s better to understand the meaning
of all attributes, or the process will block very often.
3.
Fix data problem is a very important step. After fixing data problem, the mining
result will be more accuracy.
14