Download Food Label Data Collection Using OCR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Convolutional neural network wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Transcript
Food Label Data Collection Using OCR
Royce Nobles, UNCW CSC592 Pattern Recognition, Spring 2004
1 INTRODUCTION
1.1 Motivation for Food Label Data Collection Using OCR
The difficulty involved in nutrient data collection is a major issue facing the
nutrient analysis community. Nutrient data collection is generally undertaken by
researchers from the United States Department of Agriculture’s Nutrient Data
Laboratory and major nutrient analysis software companies. The Nutrient Data
Laboratory periodically releases its findings in the form of a database known as the
National Nutrient Database for Standard Reference (NNDSR), which is freely
available to the general public. The heart of the problem lies in the frequency with
which the NNDSR is released and in the completeness and resolution of the
database. The most current release as of April 26, 2004, NNDSR16-1, contains less
than 7,000 foods and is not likely to be updated within the next year. An added
vexation is the fact that values for specific foods are often averaged and presented
as categories which can be quite general.
This poses a major problem for nutrient analysis software companies, as end
users demand that nutrient analysis software products contain a broad range of
specific food items. The result is that these companies must collect nutrient data for
foods not specifically covered in the most current NNDSR version. The process of
collecting and categorizing foods and their corresponding nutrient data currently
involves individuals who manually enter nutrient data from Food Labels into a food
database. There are major drawbacks to this method of data collection including
high labor costs, the introduction of human data entry errors and the relatively slow
speed at which data can be collected manually. The focus of this project is to
automate this process thereby reducing the negative impact of these issues.
1.2 Specific Aim And Implementation
The specific aim of this project is to develop an Optical Character Recognition
system to automate the process of collecting nutrient data from standard Nutrient
Facts Food Label images. The optical character recognition system is divided into
five easily discernable components and discussed in detail. A brief description of
each component is listed below.
1. Optical Sensing – A computer attached optical scanner is used to collect an
image from a Nutrient Facts Food Label. The image is then filtered and
prepared for segmentation and grouping.
2. Segmentation and Grouping – The image is first segmented into horizontal
rows of pixel data. Each row is then further divided into specific words which
correspond to classes.
3. Feature Extraction – Specific features are extracted from each word and
prepared for classification.
4. Classification – Individual words are compared with template data using
documented pattern recognition techniques and classified.
5. Post Processing – According to the recommendation of the classifier, nutrients
and their values are united and stored in a food database.
2 OPTICAL SENSING
2.1 Collection of Sample Data
An initial set of eight Nutrient Facts Food Labels ranging in height from 399 to
794 pixels were selected for use as sample data. These labels were chosen primarily
because of similarities in their font sizes and weights, and because they contain a
good sampling of the set of nutrients displayed on Food Labels. Each label consists
of a light colored background, dark blue or black text and a dark border. Food
Labels are stored as 256 color GIF images allowing them to be opened and
manipulated easily.
2.2 Sensing Procedures
The Food Label image is opened and stored as an object containing an image and
the label height and width in pixels. The Food Label object is then prepared for
segmentation by an image filter class consisting of a border filter, a Prewitt edge
detector and a median filter. SOBEL and Laplace edge detectors were also tested in
place of the Prewitt edge detector but yielded significantly less helpful results.
The border filter is a simple set of procedures that remove any existing borders
from the image to simplify the segmentation process. It begins by testing the top
edge of the image for horizontal regions of high pixel intensity. Once a region of
high pixel intensity is located near the top of the image, it is marked as the lower
boundary of the border and all pixels above that region are converted to white. The
same basic procedure is used to locate and effectively erase the bottom, left, and
right borders respectively.
Figure 1 – Shows the set of two 3x3 convolution masks used
by the Prewitt edge detector.
Once any borders have been removed from the image, a Prewitt edge detector is
used to reduce noise and improve the systems ability to locate related regions of low
pixel intensity. The Prewitt edge detector is designed to respond to edges of
contrast running vertically and horizontally relative to the pixel grid with one mask
for each orientation. The primary benefit of choosing this algorithm is that it
partially fills the space between individual characters in words while reducing
pixilation and maintaining word spacing.
A 3x3 median filter is used to remove any residual noise left behind by the
Prewitt edge detector. The median filter is used because it is extremely useful in
removing speckling, a common side effect of optical scanning, without the loss of
image quality generated by other commonly used filters such as low pass and mean.
The 3x3 mask was chosen as opposed to the more common 5x5 mask to preserve
the delicate edges associated with alpha numeric characters.
The result of filtering a Nutrient Facts Food Label image with the techniques
described above is illustrated in Figure 2. The contrast between the foreground text
and background is greatly enhanced, and nearly all noise generated by scanning has
been removed. Also notice that the white space between characters has been
greatly reduced thereby decreasing the likelihood that words will be incorrectly
segmented internally.
Figure 2 - Shows a section of Sample Image 1 before filtering (left), and after filtering (right).
3
SEGMENTATION AND GROUPING
The image is first divided into rows of data by locating horizontal areas of high
pixel intensity separating horizontal areas of low pixel intensity. The low pixel
intensity areas are considered to be rows of potentially valuable text and are
collected for further segmentation. Each row is stored as a list containing a one
dimensional array of pixel intensity values and the height and width of the row.
Rows collected measuring less than eight pixels in height are discarded due to the
observation that actual rows of text range from fifteen to seventeen pixels in height.
These narrow rows are commonly residual noise left behind after filtration or the
horizontal dividing lines between rows containing nutrient data.
Each row is then divided into words based on vertical columns of high pixel
intensity separating areas of lower pixel intensity. These words are treated as
specific classes by the system and are stored as word objects with properties
including a pixel intensity array, width, height, row number and the order of
occurrence within the row.
4
FEATURE EXTRACTION
Features used for classification are extracted from the pixel data of each word.
In this system, the power spectrum consisting of eight normalized measurements is
collected as the primary feature. This is generated by collecting a 64x16 pixel
sample from the upper left corner of the word. Words less than 64 pixels in width
are stuffed with white space to fill 64 pixels, while wider words are truncated. As
demonstrated in Figure 3, the percentage of window filled by each word is quite
variable.
Figure 3 - Shows the 64x16 pixel data sample window with collected words.
A two dimensional Fourier Transform is then applied to a complex number
representation of the 64x16 pixel data window. The powers are collected by
summing the squares of the real and complex components of the transform data and
stored in a 64x16 array as shown in Figure 4. The positive harmonics are collected
from the upper left 32x8 section of the array and divided into eight 8x4 feature bins.
The values in each bin are summed and normalized by the maximum component to
yield the normalized power spectrum with values ranging from zero to one.
Figure 4 - Shows the eight 8x4 power spectrum bins collected from the spatial frequency
information.
5 CLASSIFICATION
5.1 The Standard Template
Several attempts at creating a template for use with the different classification
techniques failed early on. The most notable problem occurred with the first
template which consisted of slightly modified versions of each nutrient found on
Sample Label 1. The lack of variance demonstrated by the template resulted in the
failure of the Bayesian classifier and greatly degraded the performance of the Feed
Forward Back Propagation Neural Network when attempting to classify sample labels
2 through 7.
These issues gave rise to the creation of the standard template which consists of
every nutrient name found on each of five Food Labels chosen from the original set
of eight. The five labels used to create the standard template were chosen because
they exhibited font size and color patterns that appeared to be representative of the
complete set of eight labels. Once created, the standard template was processed
and filtered by the same system used to collect features from the sample labels to
ensure the template data would accurately reflect the sample data.
A value key consisting of a five digit binary number for every word found on the
standard template was created as a means of assigning values for the classification
of collected words. The key values for each word range from zero to seventeen and
are assigned in alphabetical order. Figure 5 shows the complete binary key set as
well as a small portion of the standard template.
Figure 5 - Shows the binary key for the class values (left) and a sample of the
standard template (right).
5.2 The Bayesian Classifier
The a priori data for the Bayesian Classifier is collected from the eight power
spectrum bin values associated with each word from the standard template. The bin
values constitute class exemplars which are used as input vectors to calculate class
means. The class mean is then subtracted from each input vector for the class and
multiplied by its transverse to create a matrix. These matrices are then added
together and multiplied by the reciprocal of the number of exemplars to yield a
covariance matrix. The determinant and the inverse covariance matrix are then
calculated. From there the Bayesian formula is used to create a discriminant
function for the class. The discriminant function for each class is applied to the bin
value input vector of the word object being classified. The class whose discriminant
function returns the largest value is then chosen as the match.
The first attempted implementation of the Bayesian classifier failed due to a lack
of variability in the original template as mentioned in Section 5.1. The Bayesian
classifier now executes correctly with the standard template, but statistics have not
yet been compiled to indicate the accuracy of the predictions.
5.3 The Feed Forward Back Propagation Neural Network
The Feed Forward Back Propagation Neural Network is composed of eight input
neurons corresponding to the eight power spectrum bin values, one hidden layer
with fifty neurons, and five output neurons corresponding to the five binary digits of
the class key values. A sigmoid response function is used with a learning rate of one
to calculate connection weights between neurons, and each neuron is assigned a bias
of one. The network was trained on 200 patterns per session randomly chosen from
the template until a Total Sum Square Error of 0.001 was reached. The network
typically trained successfully in 30,000 to 40,000 sessions.
Results are encouraging with the Feed Forward Back Propagation Neural Network
as it correctly classifies 93.40% of words when false positives generated by the
collection of non-features during feature extraction are omitted. When these results
are averaged with the results of actual features classified, the percentage of correct
classification is somewhat lower.
6
POST PROCESSING
Post processing has not been implemented in the system at this point. The
major goals of post processing will be to link the features collected to form nutrients
with corresponding values and to check for obvious errors generated during
classification. The row and order of classified words will be used to link them
together forming complete nutrient names. Words that are combined to form nonexistent nutrient names can be flagged as classification errors.
Figure 4 – Shows an example of constructing full nutrient names from words.
7
FUTURE WORK
Much work remains to be done in several areas of the project. The key topics for
exploration include better methods for determining which features are relevant to
extract from the labels, a separate classification system for identifying nutrient
values, scaling methods to allow a greater size range of labels to be classified,
experimentation with other classifiers such as ART-2 and Kohonen and post
processing methods for error checking and data storage. Experimental results thus
far are encouraging, and further investigation of the issues mentioned above will
most likely yield improved results.