Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining on NIJ data Sangjik Lee Unstructured Data Mining Text Image Keyword Extraction Feature Extraction Structured Data Base Structured Data Base Data Mining Data Mining Handwritten CEDAR Letter Document Level Features Measure of Pen Pressure Measure of Writing Movement Measure of Stroke Formation Slant Word Proportion 1. Entropy 2. Gray-level threshold 3. Number of black pixels 4. Stroke width 5. Number of interior contours 6. Number of exterior contours 7. Number of vertical slope components 8. Number of horizontal slope components 9. Number of negative slope components 10. Number of positive slope components 11. Slant 12. Height Character Level Features Sy(i,j) tan Sx(i,j) -1 Gradient Grid Pos. ID Directional (0,0) G01-00 1° ~ 30° : : (3,3) (x,y) (x,y) (x,y) (x,y) (x,y) (x,y) (x,y) (x,y) (x,y) (x,y) (x,y) G01-33 G02-xy G03-xy G04-xy G05-xy G06-xy G07-xy G08-xy G09-xy G10-xy G11-xy G12-xy : Structural ID Rule S01-00 r1 : 1° ~ 30° S01-33 31° ~ 60° S02-xy 61° ~ 90° S03-xy 91° ~ 120° S04-xy 121° ~ 150° S05-xy 151° ~ 180° S06-xy 181° ~ 210° S07-xy 211° ~ 240° S08-xy 241° ~ 270° S09-xy 271° ~ 300° S10-xy 301° ~ 330° S11-xy 331° ~ 360° S12-xy Concavity Features ID Concavity C-CP-00 Coarse Pixel Density : : : r1 r2 r3 r4 r5 r6 r7 r8 r9 r 10 r 11 r 12 C-CP-33 C-HR-xy C-VR-xy C-UC-xy C-DC-xy C-LC-xy C-RC-xy C-HC-xy Coarse Pixel Density horizontal run length vertical run length Upward concavity Downward concavity Left concavity Right concavity Hole concavity Character Level Features Gradient : 000000000011000000001100001110000000111000000011000000 11000100000000110000000000000111001100011111000011110000000010 01010000010001110011111001111100000100000100000000000000000000 01000001001000 (192) Structure : 000000000000000000001100001110001000010000100000010000 000000000100101000000000011000010100110000110000000000000100100 011001100000000000000110010100000000000001100000000000000000000 000000010000 (192) Concavity : 11110110100111110110011000000110111101101001100100000 110000011100000000000000000000000000000000000000000111111100000 000000000000 (128) Writer and Feature Data Writer data Gen Age Han Edu Feature data (normalized) Ethn Sch M F <14 <24 <44 <64 <84 >85 L R H C H W B A O U F dark blob hole slant width skew ht int int int real int real int 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .95 .49 .70 .71 .50 .10 .30 .94 .49 .75 .70 .50 .11 .30 .94 .49 .67 .74 .50 .10 .30 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 .93 .72 .33 .47 .50 .21 .28 .93 .74 .33 .48 .50 .22 .26 .93 .79 .36 .54 .50 .18 .27 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .92 .30 .61 .66 .60 .11 .35 .94 .42 .72 .66 .60 .11 .32 .94 .40 .75 .67 .60 .12 .34 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 .96 .30 .60 .59 .50 .10 .21 .95 .32 .60 .59 .50 .09 .22 .95 .30 .66 .60 .50 .10 .21 Instances of the Data (normalized) Feature document level data (12 features) Entropy dark pixel blob hole hslope nslope pslope vslope slant width ht real int int int int int int int int real int int .95 .49 .70 .71 .50 .10 .51 .92 .13 .94 .49 .75 .70 .50 .11 .53 .84 .26 .94 .49 .67 .74 .50 .10 .45 .85 .23 .47 .54 .48 .32 .35 .32 .21 .18 .22 .93 .72 .33 .47 .50 .21 .28 .30 .66 .93 .74 .33 .48 .50 .22 .26 .30 .60 .93 .79 .36 .54 .50 .18 .27 .32 .60 .60 .59 .59 .42 .45 .52 .10 .10 .09 .92 .30 .61 .66 .60 .11 .35 .49 .70 .94 .42 .72 .66 .60 .11 .32 .49 .67 .94 .40 .75 .67 .60 .12 .34 .49 .75 .71 .74 .70 .57 .53 .54 .10 .10 .11 .96 .30 .60 .59 .50 .10 .21 .30 .66 .95 .32 .60 .59 .50 .09 .22 .30 .60 .95 .30 .66 .60 .50 .10 .21 .32 .60 .60 .59 .59 .36 .39 .34 .10 .10 .09 Data Mining on sub-group White female White male Black female Black male Data Mining on sub-group (Cont.) Subgroup analysis is useful information to be mined. • 1-constraint subgroups {Male: Female}, {White : Black : Hispanic}, etc. • 2-constraints subgroups {Male-white: Female-white}, etc. • 3-constraints subgroups {Male-white-25~45: Female-white-25~45}, etc. Gen Age Han Edu Ethn Sch M F <14 <24 <44 <64 <84 >85 L R H C H W B A O U F There are a combinatorially large number of subgroups. subgroups W Gender If |W| < support, reject Age Handedness Constraints Ethnicity 1 G A H E D S eDucation Schooling 2 GA 3 GAH . . . GH GAE GE GD GAD GS GAS AH GHE AE AD GHD AS GHS . . . GAHEDS HE GED HD GES HS GDS ED ES AHE DS …… Database Writer data Raw feature data Normalized feature data Color Scale 0.0 ~ 1.0 Feature Database (White and Black) Female white 12~24 25~44 45~64 >= 65 Male black white black What to do 1. Feature Selection • Process that chooses an optimal subset of features according to a certain criterion (Feature Selection for knowledge discovery and data mining by Huan Liu and Hiroshi Motoda) • Since there are limited number of writer in each sub-group, reduced subset of features is needed. • To improve performance (speed of learning, predictive accuracy, or simplicity of rules) • To visualize the data for model selection • To reduce dimensionality and remove noise Feature Selection Example of feature selection 7-11 1-3 9-11 7-9 Feature 1-2 ~ 2-3 Feature 6-10 ~ 8-12 Feature 9-10 ~ 11-12 • Knowing that some features are highly correlated to some others can help removing redundant features What to do 2. Visualization of trend (if any) of writer sub-groups • Useful tool so that we can quickly obtain an overall structural view of the trend of sub-group • Seeing is Believing ! Implementation of Subgroup Analysis on NIJ Data Task: Which writer subgroup is more distinguishable than others (if any)? Writer Data Find a subgroup that has enouth support Feature Data Data Preparation Subgroup Classifier The Result of Subgroup Classification Results Procedure for writer subgroup analysis Find subgroup that has enough support Choose ‘the other’ (complement) group Make data sets(4) for Artificial Neural Network Train ANN and get the results from two test sets Limit 3 categoris are used (gender, ethnicity and age) up to 2 constraints are considered only Document-level features are used Subgroup Classifier dark 1 This is a test. This is a sample writing for document 1 written by an author a. Feature space representation of Handwritten document is This is a test. This is a sample writing for document 1 written by an author a. of Handwritten document is blob Feature extraction hole slant height Artificial neural network (11-6-1) Writer is Which group? The Result of Subgroup Classification Results Error Rate (Test1) Error Rate (Test2) Age1 25.6% 33.9% Age2 31.5% 30.2% Age3 44.9% 41.9% Age4 28.7% 32.4% Age5 19.1% 18.8% White 29.8% 32.3% Black 30.2% 31.7% Hispanic 25.2% 33.8% Female2 32.4% 33.3% Female3 30.0% 36.7% Female4 25.5% 20.3% Female5 15.0% 16.6% Female Black 29.7% 34.7% Female White 32.6% 34.8% Male2 43.6% 31.9% Male3 38.0% 40.0% Male White 32.7% 34.1% Average 29.8% 30.9% 43.4% 30.6% 19.0% 31.1% 31.0% 29.5% 32.9% 33.4% 22.9% 15.8% 32.2% 33.7% 37.8% 39.0% 33.4% They’re distinguishable, but why... • Need to explain why they’re distinguishable • ANN does a good job, but can’t explain clearly its output • 12 features are too many to explain and visualize • Only 2 (or 3) dimensions are visualizable • Question : Does a reasonable two or three dimensional representation of the data exist that may be analyzed visually? Reference : Feature Selection for Knowledge Discovery and Data Mining - Huan Liu and Hiroshi Motoda Feature Extraction • Common characteristic of feature extraction methods is that they all produce new features y based on the original features x. • After feature extraction, representation of data is changed so that many techniques such as visualization, decision tree building can be conveniently used. • Feature extraction started, as early as in 60’s and 70’s, as a problem of finding the intrinsic dimensionality of a data set the minimum number of independent features required to generate the instances Visualization Perspective • Data of high dimensions cannot be analyzed visually • It is often necessary to reduce it’s dimensionality in order to visualize the data • The most popular method of determining topological dimensionality is the Karhunen-Loeve (K-L) method (also called Principal Component Analysis) which is based on the eigenvalues of a covariance matrix(R) computed from the data Visualization Perspective • The M eigenvectors corresponding to the M largest eigenvalues of R define a linear transformation from the Ndimensional space to an M-dimensional space in which the features are uncorrelated. • This property of uncorrelated features is derived from a theorem stating that if the eigenvalues of a matrix are distinct, then the associated eigenvectors are linearly independent • For the purpose of visualization, one may take the M features corresponding to the M largest eigenvalues of R Applied to the NIJ data 1. Normalize each feature’s values into a range [0,1] 2. Obtain the correlation matrix for the 12 original features 3. Find eigenvalues of the correlation matrix 4. Select the largest two eigenvalues should be chosen 5. Output the chosen eigenvectors associated with the chosen eigenvalues. Here we obtain a 12 * 2 transformation matrix M 6. Transform the normalized data Dold into data Dnew of extracted features as follows: Dnew = Dold M The resulting data is of 2-dimensional having the original class label attached to each instance Applied to the NIJ data Applied to the NIJ data Sample Iris data (the original is 4-dimensional)