Download Improving Patent Organization with an Automated Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2001 Systems Engineering Capstone Conference • University of Virginia
IMPROVING PATENT ORGANIZATION WITH AN AUTOMATED
CLASSIFICATION ASSESSMENT TOOL
Student team: Jason Eshler, John Messina, Z. Rina Paz, Jason Ricketts
Faculty Advisor: Peter A. Beling
Department of Systems Engineering
University of Virginia
E-mail [email protected]
Client Advisors: David E. Martin, David S. Winer, Chris H. Starr, and Jason O. Watson
MCAM, Inc.
Charlottesville, VA
E-mail [email protected]
KEYWORDS: Classification, Morphogenetic set,
Patent, United States Patent and Trademark Office
(USPTO).
ABSTRACT
This Capstone research sought to improve patent
classification. The United States Patent and Trademark
Office (USPTO) classifies and issues all United States
Patents. The overwhelming number of patent requests
reduces the efficiency and effectiveness of the current
patent classification system. Our project focused on
designing and implementing an automated tool to
analyze the effectiveness and accuracy of the current
patent classification system. The tool assesses the
accuracy of patent classification based on a statistical
analysis of the patent class frequency distribution for a
patent’s or patent groups’ Morphogenetic set. The
Morphogenetic set is an expanded group of reference
patents for a particular patent or patent group.
Reference patents reference previously issued patents,
known as prior art. We coded algorithms, developed
appropriate statistics, and utilized patent databases to
analyze Morphogenetic set class distributions. The
results obtained from the statistical tool include
frequency mean, standard deviation, root mean square
error, and the cumulative distribution for all classes
within the Morphogenetic set. Our analysis and tool
help users effectively assess current classification
accuracy and consistency and present a large amount of
information in a clear and manageable form.
INTRODUCTION TO CLASSIFICATION
A patent is a grant made by a government that
confers upon the creator of an invention the sole right to
make, use, and sell that invention for a set period of
time. The Patent Classification System consists of over
six million patents classified into over four hundred
classes and over two hundred thousand sub-classes.
Patent applications have soared to more than two
hundred thousand per year recently, significantly
increasing the size, complexity, and inconsistency of
the patent classification database. In particular, the
classification inconsistencies seem to have two primary
causes. The first cause is inventors influencing the
classification of patents through the application
process. The second cause is patent examiners
determining the actual issuance and classification of
patents through the examination process.
The USPTO issues patents only for ideas or designs
exhibiting the following traits: novel, non-obvious and
reducible to practice. Due to the large volume of
patents submitted, a patent classification expert devotes
only five minutes to classifying each patent. This
manual system of classification is susceptible to errors
and is heavily influenced by patent applicants’
suggested fields of search. Two possible loopholes
exist in the application process. First, although
inventors are required to include a list of patents that
are related to their invention, they may choose any set
of patents that they feel best describes their invention.
An inventor seeking a patent can intentionally mislead
examiners by failing to cite similar patents. Unless the
patent examiners are familiar with existing intellectual
property in many subject areas, it is possible for
identical ideas to be patented. On the other hand, the
time taken to process an application, which was nearly
thirteen months in 1999, also makes it possible for an
idea to be patented twice. For example, similar patents
could be processed simultaneously without the
USPTO’s knowledge.
97
Improving Patent Organization
The patent examination phase consists of patent
examiners reviewing applications and granting patent
rights to inventors. To ensure consistency of
classification among examiners, all examiners use the
same tools, such as the Patent Classification System,
the Classification Manual, and the Classification
Definitions Book to issue patents. However,
classification consistency is very difficult to obtain
because each of the 350 patent examiners on the
classification board differ in their interpretation of case
law and the awareness of classification policy changes.
Likewise, the time allotted to examiners to inspect
individual patent applications, in this phase, is often not
enough to complete a careful analysis and research of
the invention.
patents and represented by triangle 2. A second
iteration of the process completes the Morphogenetic
set. It takes the two new patent groups, represented by
triangle 1 and 2, and finds their reference patents and
referenced by patents, circles 3 and 4 and circles 5 and
6 respectively, as demonstrated in Figure 1.
3
1
4
5
6
2
Represents the patent in question
In addition, the classification definitions used to by
the patent examiners to classify patents also provide
another venue by which examiners can misclassify
patents. Although many of the class definitions
contain notes to supplement definitions by explaining
terms or giving examples, they are inherently
subjective. One examiner actually placed a patent
application for a baby’s bottle with a top shaped like a
human breast in the same class as syringes, even though
all other bottles are classified under the plastic
container class.
1 Represents a patent referenced by
patent
3 Represents a patent referenced by
1
patent
4 Represents a patent that references
1
patent
2 Represents a patent that references
patent
5 Represents a patent referenced by
2
patent
6 Represents a patent that references
2
patent
DATA METHODS
Various methods exist for updating and improving
patent classification. The USPTO has tried to improve
the quality of patent classification methodology by
providing patent examiners with two computer-based
systems: the Examiner Automated Search Tool and the
Web-based Examiner Search Tool. However, both
systems are too large and slow to improve
classification. MCAM uses existing information to
supplement current classification methods. This
additional information helps judge the accuracy and
consistency of patent classification. We used
Morphogenetic sets to assess the current classification
system.
A Morphogenetic set is an extended group of
reference patents. A Morphogenetic set creates a
grouping of similar patents and is an improvement upon
finding similar patents using USPTO classes and subclasses. To create a Morphogenetic set, one chooses a
patent in question, represented by the black square in
Figure 1. From this patent, one selects its reference
patents, triangle 1. Next, one selects patents that
reference the patent of interest, termed referenced by
Figure 1: The Morphogenetic Set – MCAM’s innovative tool is
illustrated with an explanation of the symbols. The patent in
question is in the middle of the symbol (the square), its prior art is
the triangle above it and its future art is the triangle below it.
Data Collection
Each Morphogenetic set contains a list of patents
and each patent has a class and sub-class. We
aggregated this list of classes and their frequencies for
each patent within the patent or patent group’s
Morphogenetic set. Coded algorithms extracted the
Morphogenetic set and summed the frequencies of
observations of different classes across all patents from
the list. Figure 2 shows a typical frequency distribution
for an individual patent. Each row represents a
different class, in descending order of class frequency.
Order
1
2
3
4
5
6
…
Frequency
218
62
48
46
31
16
…
Class
320
D12
043
142
D53
623
…
Figure 2: Frequency Distribution – A typical distribution that is
determined by using MCAM’s Morphogenetic set.
98
2001 Systems Engineering Capstone Conference • University of Virginia
ANALYSIS
root mean square error by 100 to increase the scale of
the score.
To test the overall group distribution we selected
four different groups and plotted the classes along the
x-axis and percent occurrence along the y-axis. An
example of this output is shown in Figure 3.
Figure 4: Comparing Distributions – The distributions for the group
the individual patent are graphed. The values are simply the
percentages of each matching class for the group and the patent.
IMPLEMENTATION
Figure 3: Group Distribution – The distribution for a group of
patents is graphed with the classes on the x-axis and their percent
of occurrence on the y-axis.
We hypothesized that the graphs would differ based
on the consistency of the group. For example, we
expected different curve shapes and different statistics
to result. First, we compared the percent contribution
to the overall group from the most frequent class.
Second, we calculated the mean and standard deviation
of class frequencies for the three and five most frequent
classes from each group. These smaller subsets
highlighted the most influential classes for each group
and removed much of the random noise generated by
subsequent classes
The second analysis we performed involved
comparing individual patents to a group in question.
Figure 4 illustrates an example of the output generated.
One line represents the group and the other represents
the individual patent. The x-axis contains all of the
classes that are shared between the group and the
patent, and the y-axis is of the percent of occurrence.
We quantified the data in the graph by scoring the
goodness of fit. The score begins with the root mean
square error of the two sets of data. One calculates the
root mean square error by summing the squared
differences of percent composition for each common
class, dividing by the number of common classes, and
finally taking the square root. We next multiply the
Two primary aspects of constructing the tool were
designing the back-end and the front-end. The backend of the system needed to interact with existing
patent databases to generate the Morphogenetic set. On
the other hand, the front-end needed to display all
information extracted from the patent database in a
user-friendly environment, such as a graphical user
interface (GUI).
To extract the necessary information from patent
databases, we used the programming language Perl.
Microsoft Visual Basic (VB) was used to code the
front-end. We selected VB because charts,
spreadsheets, and summary reports could be easily
programmed into our tool. Additionally, because of the
familiarity of the Windows-based environment, VB was
ideal since the visual display of the buttons, charts, and
spreadsheets emulates Microsoft’s Windows operating
system and other Microsoft products.
RESULTS
To assess patent classification using the
Morphogenetic sets, our analysis tool compared
different scenarios for patent classification. Case Study
I compared patent groups and Case Study II compared
individual patents with patent groups. The analysis tool
used objective metrics to determine the accuracy of
classification.
99
Improving Patent Organization
Case Study II Results
Group
Case Study I: Group Consistencies
Case Study I analyzed the consistency between
different patent groups. The results showed that the
mean and standard deviation increased with an increase
in the accuracy of the patent group definition. A
summary of means and standard deviations of the first
three and first five classes for all four groups examined
appears in Figure 5. The graph illustrates the
tendencies for means and standard deviations to move
together and increase as the group becomes more
specific.
Random Group
Poorly defined subclass
Facsimile
Better defined subclass
Pump
Well defined concept
VLIW
Case Study I Results
Top 3 Classes
Mean Std
Dev
4.50
0.13
Top 5 Classes
Mean Std Dev
4.19
0.44
11.25
3.99
8.79
4.44
18.19
10.18
13.65
9.54
27.65
21.06
17.49
20.43
Top 5 Classes
20.00%
18.00%
16.00%
Mean
14.00%
Random
4956133 vs. Random Group
Poorly defined sub-class
5682421 vs. Facsimile
Better defined sub-class
5380159 vs. Pump
Well defined concept
5832205 vs. VLIW
4901307 vs. CDMA
Score
9.92
6.65
5.70
2.01
0.82
Figure 6: Scores – The score decreases from the random group to the
well defined concept because the individual patent becomes more
consistent with the group.
CONCLUSIONS
Summary
The project created a tool to compare different
classes for accuracy and consistency of USPTO’s
patent classification system. We compared different
groups of patents and assessed their consistencies. This
analysis revealed stark contrasts in consistencies of
patent groups. The results from Case Study I showed
that the mean and standard deviation increased with an
increase in the accuracy of the patent group definition.
The results from Case Study II showed a decrease in
RMSE, between the distributions of an individual
patent and its respective patent group, with an increase
in the accuracy of the patent group definition.
12.00%
10.00%
Interpretation
8.00%
6.00%
4.00%
2.00%
0.00%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
Standard Deviation
Figure 5: Means and Standard Deviations – The means and standard
deviations all increase from the random group to the well defined
concept. When plotted, a near perfect line results.
Case Study II: Individual Patent vs. Patent Group
Case Study II analyzed the patent classification
accuracy when compared to a patent group. The results
showed a decrease in score between the distribution of
an individual patent with its respective patent group,
with an increase in the accuracy of the patent group
definition. A randomly selected patent from a random
patent group produced a high score, while a randomly
selected patent from a well defined concept group
produced a low score. Figure 6 summarizes the results
for Case Study II.
100
The results from Case Study I showed a direct
relationship between the accuracy of patent group
definition and the mean and standard deviation of the
frequencies in the Top 3 and 5 classes of the
Morphogenetic set. In more accurately defined patent
groups the Morphogenetic set is concentrated in fewer
classes than less accurately defined groups. Therefore,
the mean and standard deviation for the Top 3 and 5
classes increased with an increase in the accuracy of the
patent group definition. In addition, the graph of the
Morphogenetic set’s class distribution illustrates this
result. An accurately defined patent group resulted in a
more convex graph of its Morphogenetic set class
distribution, as shown in Figure 7.
The results suggested that a patent taken from an
accurately defined patent group would resemble its
respective patent group in terms of its Morphogenetic
set, as illustrated in Figure 8. Figure 8 is a comparison
of the IBM keyword search for CDMA vs. Patent
2001 Systems Engineering Capstone Conference • University of Virginia
More accurate
Base example
Less accurate
Figure 7: Group Distribution – Example distributions for a group
of patents are graphed. The figure compares distributions of
varying accuracy.
4901307; this represents the case of a patent taken from
a well defined patent group. Figure 4 is a comparison
of Patent Class 358/305 vs. Patent 5682421; this
represents the case of a patent taken from a poorly
defined patent group. In each case, the darker line
shows the class frequency distribution for the patent
group while the lighter line shows the distribution for
the individual patent. The graphs in Figures 4 and 8
illustrate the lower scores resulting from the
comparisons of individual patents to more accurately
defined groups.
solidify our conclusions and suggest problematic subclasses. Secondly, the tool does not have the ability to
compare one group of patents to another group. This
comparison is extremely beneficial between a class and
a concept; this would assess how similar a class of
patents is to the class’ concept.
Additional analysis for patent group comparisons
will demonstrate how changes in mean and standard
deviation correlate to a change in accuracy of patent
classification. Further analysis of Case Study II will
reveal how changes in the RMSE score reflect changes
in the accuracy of a patent group’s definition and an
individual patent’s classification. Preliminary results
from the analysis tool provided promising insight into
the accuracy of patent classification. Conducting
additional research will validate this use of this analysis
tool to assess the accuracy and consistency of patent
classification.
REFERENCES
Chen, H. and S. Dumais. 2000. “Bringing Order to the
Web: Automatically Categorizing Search Results.”
Proceedings of CHI’00, Human Factors in Computing
Systems: 145-152.
Coy, P. “The Creative Economy.” Business Week 28
August 2000: 75-214.
Deerwester, S., S. Dumais and R. Harshman. “Indexing
by Latent Semantic Analysis.” Journal of the Society
for Information Science 41(6): 391-407.
Figure 8: Comparing Distributions – The distributions for the group
the individual patent are graphed for CDMA. The values are simply
the percentages of each matching class for the group and the patent.
Recommendations
The data illuminates errors and inconsistencies in
the USPTO classification system, but a drawback of
this analysis method is that it requires prior knowledge
of classification to interpret and draw conclusions from
the class distributions provided by the analysis tool.
Our Capstone group has several recommendations
for the future. Optimally, one could apply the tool to
every sub-class of the USPTO and every patent
classified by the USPTO. Such a technique would
Koch, T. “The role of classification schemes in Internet
resource description and discovery.” DESIRE –
Development of a European Service for Information on
Research and Education Deliverable 3.2.3 – 19
February 1999. Online. Internet. 30 September, 2000.
Available: http://www.ub.lu.se/desire/radar/reports/
D3.2.3/class_v10.rtf.
“MCAM Doors.” MCAM. Online. Internet. 4
March 2001. Available: http://www.m-cam.com
Molholt, P. “Qualities of Classification Schemes for
the Information Superhighway.” Cataloging &
Classification Quarterly Volume 21, Number 2, 1995.
pp 19-22.
Rivette, K. G. and D. Kline. “Discovering New Value
in Intellectual Property.” Harvard Business Review
January 2000 v78.
101
Improving Patent Organization
BIOGRAPHIES
Jason Eshler is a fourth-year Systems Engineering
major from Richmond, VA. His principal contributions
to the project included Perl, Visual Basic coding for the
analysis tool and unlocking the door to the FERG room.
Jason has accepted a position with Providian Financial.
John Messina is a fourth-year Systems Engineering
student from Glenview, IL. His principal contributions
to the project focused on statistics and data analysis as
well as unnecessary work creation for Mr. Ricketts and
pretending “to-know.” Mr. Messina will be working
for Salomon Smith Barney’s Investment Banking
Division in Chicago following graduation.
Rina Paz is a fourth-year Systems Engineering major
from Alexandria, VA. Her principal contribution to
this project was data testing and often pretending “notto-know.” Rina is moving to L.A. to expand her
horizons and may be attending medical school upon
graduation if the witness protection program does not
accept her.
Jason Ricketts is a fourth year Systems Engineering
major from Stamford, CT. Jason’s principal
contributions were Perl and Visual Basic coding, data
analysis and spending hours in the FERG room
pretending to work. He has accepted a position in New
York with PricewaterhouseCoopers’ Management
Consulting Services where he hopes to kiss more a$$
than he did during his tenure at UVA.
102