Download Individuality of Handwriting: A Validation Study

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining on NIJ data
Sangjik Lee
Unstructured Data Mining
Text
Image
Keyword Extraction
Feature Extraction
Structured Data Base
Structured Data Base
Data Mining
Data Mining
Handwritten CEDAR Letter
Document Level Features
Measure of
Pen Pressure
Measure of
Writing Movement
Measure of
Stroke Formation
Slant
Word Proportion
1. Entropy
2. Gray-level threshold
3. Number of black pixels
4. Stroke width
5. Number of interior contours
6. Number of exterior contours
7. Number of vertical slope components
8. Number of horizontal slope components
9. Number of negative slope components
10. Number of positive slope components
11. Slant
12. Height
Character Level Features
Sy(i,j)
tan
Sx(i,j)
-1
Gradient
Grid Pos.
ID
Directional
(0,0)
G01-00
1° ~ 30°
:
:
(3,3)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
(x,y)
G01-33
G02-xy
G03-xy
G04-xy
G05-xy
G06-xy
G07-xy
G08-xy
G09-xy
G10-xy
G11-xy
G12-xy
:
Structural
ID
Rule
S01-00
r1
:
1° ~ 30°
S01-33
31° ~ 60° S02-xy
61° ~ 90° S03-xy
91° ~ 120° S04-xy
121° ~ 150° S05-xy
151° ~ 180° S06-xy
181° ~ 210° S07-xy
211° ~ 240° S08-xy
241° ~ 270° S09-xy
271° ~ 300° S10-xy
301° ~ 330° S11-xy
331° ~ 360° S12-xy
Concavity Features
ID
Concavity
C-CP-00
Coarse Pixel Density
:
:
:
r1
r2
r3
r4
r5
r6
r7
r8
r9
r 10
r 11
r 12
C-CP-33
C-HR-xy
C-VR-xy
C-UC-xy
C-DC-xy
C-LC-xy
C-RC-xy
C-HC-xy
Coarse Pixel Density
horizontal run length
vertical run length
Upward concavity
Downward concavity
Left concavity
Right concavity
Hole concavity
Character Level Features
Gradient : 000000000011000000001100001110000000111000000011000000
11000100000000110000000000000111001100011111000011110000000010
01010000010001110011111001111100000100000100000000000000000000
01000001001000
(192)
Structure : 000000000000000000001100001110001000010000100000010000
000000000100101000000000011000010100110000110000000000000100100
011001100000000000000110010100000000000001100000000000000000000
000000010000
(192)
Concavity : 11110110100111110110011000000110111101101001100100000
110000011100000000000000000000000000000000000000000111111100000
000000000000
(128)
Writer and Feature Data
Writer data
Gen
Age
Han Edu
Feature data (normalized)
Ethn
Sch
M F <14 <24 <44 <64 <84 >85 L R H C H W B A O U F
dark blob hole slant width skew ht
int int int real int real int
0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1
0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1
0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1
.95 .49 .70 .71 .50 .10 .30
.94 .49 .75 .70 .50 .11 .30
.94 .49 .67 .74 .50 .10 .30
1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0
1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0
1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0
.93 .72 .33 .47 .50 .21 .28
.93 .74 .33 .48 .50 .22 .26
.93 .79 .36 .54 .50 .18 .27
1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1
1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1
1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1
.92 .30 .61 .66 .60 .11 .35
.94 .42 .72 .66 .60 .11 .32
.94 .40 .75 .67 .60 .12 .34
1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1
1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1
1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1
.96 .30 .60 .59 .50 .10 .21
.95 .32 .60 .59 .50 .09 .22
.95 .30 .66 .60 .50 .10 .21
Instances of the Data (normalized)
Feature document level data (12 features)
Entropy dark pixel blob hole hslope nslope pslope vslope slant width ht
real
int int int int int
int
int
int
real int int
.95 .49 .70 .71 .50 .10 .51 .92 .13
.94 .49 .75 .70 .50 .11 .53 .84 .26
.94 .49 .67 .74 .50 .10 .45 .85 .23
.47
.54
.48
.32
.35
.32
.21
.18
.22
.93 .72 .33 .47 .50 .21 .28 .30 .66
.93 .74 .33 .48 .50 .22 .26 .30 .60
.93 .79 .36 .54 .50 .18 .27 .32 .60
.60
.59
.59
.42
.45
.52
.10
.10
.09
.92 .30 .61 .66 .60 .11 .35 .49 .70
.94 .42 .72 .66 .60 .11 .32 .49 .67
.94 .40 .75 .67 .60 .12 .34 .49 .75
.71
.74
.70
.57
.53
.54
.10
.10
.11
.96 .30 .60 .59 .50 .10 .21 .30 .66
.95 .32 .60 .59 .50 .09 .22 .30 .60
.95 .30 .66 .60 .50 .10 .21 .32 .60
.60
.59
.59
.36
.39
.34
.10
.10
.09
Data Mining on sub-group
White female
White male
Black female
Black male
Data Mining on sub-group (Cont.)
Subgroup analysis is useful information to be mined.
• 1-constraint subgroups
{Male: Female},
{White : Black : Hispanic}, etc.
• 2-constraints subgroups
{Male-white: Female-white}, etc.
• 3-constraints subgroups
{Male-white-25~45: Female-white-25~45}, etc.
Gen
Age
Han Edu
Ethn
Sch
M F <14 <24 <44 <64 <84 >85 L R H C H W B A O U F
There are a combinatorially large number of subgroups.
subgroups
W
Gender
If |W| < support, reject
Age
Handedness
Constraints
Ethnicity
1
G
A
H
E
D
S
eDucation
Schooling
2
GA
3
GAH
.
.
.
GH
GAE
GE
GD
GAD
GS
GAS
AH
GHE
AE
AD
GHD
AS
GHS
.
.
.
GAHEDS
HE
GED
HD
GES
HS
GDS
ED
ES
AHE
DS
……
Database
Writer data
Raw feature data
Normalized feature data
Color Scale
0.0
~
1.0
Feature Database (White and Black)
Female
white
12~24
25~44
45~64
>= 65
Male
black
white
black
What to do
1. Feature Selection
• Process that chooses an optimal subset of features according
to a certain criterion (Feature Selection for knowledge
discovery and data mining by Huan Liu and Hiroshi Motoda)
• Since there are limited number of writer in each sub-group,
reduced subset of features is needed.
• To improve performance (speed of learning, predictive
accuracy, or simplicity of rules)
• To visualize the data for model selection
• To reduce dimensionality and remove noise
Feature Selection
Example of feature selection
7-11
1-3
9-11
7-9
Feature 1-2 ~ 2-3
Feature 6-10 ~ 8-12
Feature 9-10 ~ 11-12
• Knowing that some features are highly correlated to some
others can help removing redundant features
What to do
2. Visualization of trend (if any) of writer sub-groups
• Useful tool so that we can quickly obtain an overall
structural view of the trend of sub-group
• Seeing is Believing !
Implementation of Subgroup Analysis on NIJ Data
Task: Which writer subgroup is more
distinguishable than others (if any)?
Writer Data
Find a subgroup that has enouth support
Feature Data
Data Preparation
Subgroup Classifier
The Result of Subgroup Classification Results
Procedure for writer subgroup analysis
Find subgroup that has enough support
Choose ‘the other’ (complement) group
Make data sets(4) for Artificial Neural Network
Train ANN and get the results from two test sets
Limit
3 categoris are used (gender, ethnicity and age)
up to 2 constraints are considered
only Document-level features are used
Subgroup Classifier
dark
1
This is a test.
This is a sample
writing for
document 1 written
by an author a.
Feature space
representation of
Handwritten
document is
This is a test.
This is a sample
writing for
document 1 written
by an author a. of
Handwritten
document is
blob
Feature
extraction
hole
slant
height
Artificial neural network (11-6-1)
Writer is
Which group?
The Result of Subgroup Classification Results
Error Rate (Test1) Error Rate (Test2)
Age1
25.6%
33.9%
Age2
31.5%
30.2%
Age3
44.9%
41.9%
Age4
28.7%
32.4%
Age5
19.1%
18.8%
White
29.8%
32.3%
Black
30.2%
31.7%
Hispanic
25.2%
33.8%
Female2
32.4%
33.3%
Female3
30.0%
36.7%
Female4
25.5%
20.3%
Female5
15.0%
16.6%
Female Black
29.7%
34.7%
Female White
32.6%
34.8%
Male2
43.6%
31.9%
Male3
38.0%
40.0%
Male White
32.7%
34.1%
Average
29.8%
30.9%
43.4%
30.6%
19.0%
31.1%
31.0%
29.5%
32.9%
33.4%
22.9%
15.8%
32.2%
33.7%
37.8%
39.0%
33.4%
They’re distinguishable, but why...
• Need to explain why they’re distinguishable
• ANN does a good job, but can’t explain clearly its
output
• 12 features are too many to explain and visualize
• Only 2 (or 3) dimensions are visualizable
• Question : Does a reasonable two or three
dimensional representation of the data exist that may
be analyzed visually?
Reference : Feature Selection for Knowledge
Discovery and Data Mining
- Huan Liu and Hiroshi Motoda
Feature Extraction
• Common characteristic of feature extraction methods is that
they all produce new features y based on the original features
x.
• After feature extraction, representation of data is changed so
that many techniques such as visualization, decision tree
building can be conveniently used.
• Feature extraction started, as early as in 60’s and 70’s, as a
problem of finding the intrinsic dimensionality of a data set the minimum number of independent features required to
generate the instances
Visualization Perspective
• Data of high dimensions cannot be analyzed visually
• It is often necessary to reduce it’s dimensionality in order to
visualize the data
• The most popular method of determining topological
dimensionality is the Karhunen-Loeve (K-L) method (also
called Principal Component Analysis) which is based on the
eigenvalues of a covariance matrix(R) computed from the
data
Visualization Perspective
• The M eigenvectors corresponding to the M largest
eigenvalues of R define a linear transformation from the Ndimensional space to an M-dimensional space in which the
features are uncorrelated.
• This property of uncorrelated features is derived from a
theorem stating that if the eigenvalues of a matrix are distinct,
then the associated eigenvectors are linearly independent
• For the purpose of visualization, one may take the M
features corresponding to the M largest eigenvalues of R
Applied to the NIJ data
1. Normalize each feature’s values into a range [0,1]
2. Obtain the correlation matrix for the 12 original features
3. Find eigenvalues of the correlation matrix
4. Select the largest two eigenvalues should be chosen
5. Output the chosen eigenvectors associated with the chosen
eigenvalues. Here we obtain a 12 * 2 transformation matrix M
6. Transform the normalized data Dold into data Dnew of
extracted features as follows:
Dnew = Dold M
The resulting data is of 2-dimensional having the original
class label attached to each instance
Applied to the NIJ data
Applied to the NIJ data
Sample Iris data (the original is 4-dimensional)
Related documents