Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2001 Systems Engineering Capstone Conference • University of Virginia IMPROVING PATENT ORGANIZATION WITH AN AUTOMATED CLASSIFICATION ASSESSMENT TOOL Student team: Jason Eshler, John Messina, Z. Rina Paz, Jason Ricketts Faculty Advisor: Peter A. Beling Department of Systems Engineering University of Virginia E-mail [email protected] Client Advisors: David E. Martin, David S. Winer, Chris H. Starr, and Jason O. Watson MCAM, Inc. Charlottesville, VA E-mail [email protected] KEYWORDS: Classification, Morphogenetic set, Patent, United States Patent and Trademark Office (USPTO). ABSTRACT This Capstone research sought to improve patent classification. The United States Patent and Trademark Office (USPTO) classifies and issues all United States Patents. The overwhelming number of patent requests reduces the efficiency and effectiveness of the current patent classification system. Our project focused on designing and implementing an automated tool to analyze the effectiveness and accuracy of the current patent classification system. The tool assesses the accuracy of patent classification based on a statistical analysis of the patent class frequency distribution for a patent’s or patent groups’ Morphogenetic set. The Morphogenetic set is an expanded group of reference patents for a particular patent or patent group. Reference patents reference previously issued patents, known as prior art. We coded algorithms, developed appropriate statistics, and utilized patent databases to analyze Morphogenetic set class distributions. The results obtained from the statistical tool include frequency mean, standard deviation, root mean square error, and the cumulative distribution for all classes within the Morphogenetic set. Our analysis and tool help users effectively assess current classification accuracy and consistency and present a large amount of information in a clear and manageable form. INTRODUCTION TO CLASSIFICATION A patent is a grant made by a government that confers upon the creator of an invention the sole right to make, use, and sell that invention for a set period of time. The Patent Classification System consists of over six million patents classified into over four hundred classes and over two hundred thousand sub-classes. Patent applications have soared to more than two hundred thousand per year recently, significantly increasing the size, complexity, and inconsistency of the patent classification database. In particular, the classification inconsistencies seem to have two primary causes. The first cause is inventors influencing the classification of patents through the application process. The second cause is patent examiners determining the actual issuance and classification of patents through the examination process. The USPTO issues patents only for ideas or designs exhibiting the following traits: novel, non-obvious and reducible to practice. Due to the large volume of patents submitted, a patent classification expert devotes only five minutes to classifying each patent. This manual system of classification is susceptible to errors and is heavily influenced by patent applicants’ suggested fields of search. Two possible loopholes exist in the application process. First, although inventors are required to include a list of patents that are related to their invention, they may choose any set of patents that they feel best describes their invention. An inventor seeking a patent can intentionally mislead examiners by failing to cite similar patents. Unless the patent examiners are familiar with existing intellectual property in many subject areas, it is possible for identical ideas to be patented. On the other hand, the time taken to process an application, which was nearly thirteen months in 1999, also makes it possible for an idea to be patented twice. For example, similar patents could be processed simultaneously without the USPTO’s knowledge. 97 Improving Patent Organization The patent examination phase consists of patent examiners reviewing applications and granting patent rights to inventors. To ensure consistency of classification among examiners, all examiners use the same tools, such as the Patent Classification System, the Classification Manual, and the Classification Definitions Book to issue patents. However, classification consistency is very difficult to obtain because each of the 350 patent examiners on the classification board differ in their interpretation of case law and the awareness of classification policy changes. Likewise, the time allotted to examiners to inspect individual patent applications, in this phase, is often not enough to complete a careful analysis and research of the invention. patents and represented by triangle 2. A second iteration of the process completes the Morphogenetic set. It takes the two new patent groups, represented by triangle 1 and 2, and finds their reference patents and referenced by patents, circles 3 and 4 and circles 5 and 6 respectively, as demonstrated in Figure 1. 3 1 4 5 6 2 Represents the patent in question In addition, the classification definitions used to by the patent examiners to classify patents also provide another venue by which examiners can misclassify patents. Although many of the class definitions contain notes to supplement definitions by explaining terms or giving examples, they are inherently subjective. One examiner actually placed a patent application for a baby’s bottle with a top shaped like a human breast in the same class as syringes, even though all other bottles are classified under the plastic container class. 1 Represents a patent referenced by patent 3 Represents a patent referenced by 1 patent 4 Represents a patent that references 1 patent 2 Represents a patent that references patent 5 Represents a patent referenced by 2 patent 6 Represents a patent that references 2 patent DATA METHODS Various methods exist for updating and improving patent classification. The USPTO has tried to improve the quality of patent classification methodology by providing patent examiners with two computer-based systems: the Examiner Automated Search Tool and the Web-based Examiner Search Tool. However, both systems are too large and slow to improve classification. MCAM uses existing information to supplement current classification methods. This additional information helps judge the accuracy and consistency of patent classification. We used Morphogenetic sets to assess the current classification system. A Morphogenetic set is an extended group of reference patents. A Morphogenetic set creates a grouping of similar patents and is an improvement upon finding similar patents using USPTO classes and subclasses. To create a Morphogenetic set, one chooses a patent in question, represented by the black square in Figure 1. From this patent, one selects its reference patents, triangle 1. Next, one selects patents that reference the patent of interest, termed referenced by Figure 1: The Morphogenetic Set – MCAM’s innovative tool is illustrated with an explanation of the symbols. The patent in question is in the middle of the symbol (the square), its prior art is the triangle above it and its future art is the triangle below it. Data Collection Each Morphogenetic set contains a list of patents and each patent has a class and sub-class. We aggregated this list of classes and their frequencies for each patent within the patent or patent group’s Morphogenetic set. Coded algorithms extracted the Morphogenetic set and summed the frequencies of observations of different classes across all patents from the list. Figure 2 shows a typical frequency distribution for an individual patent. Each row represents a different class, in descending order of class frequency. Order 1 2 3 4 5 6 … Frequency 218 62 48 46 31 16 … Class 320 D12 043 142 D53 623 … Figure 2: Frequency Distribution – A typical distribution that is determined by using MCAM’s Morphogenetic set. 98 2001 Systems Engineering Capstone Conference • University of Virginia ANALYSIS root mean square error by 100 to increase the scale of the score. To test the overall group distribution we selected four different groups and plotted the classes along the x-axis and percent occurrence along the y-axis. An example of this output is shown in Figure 3. Figure 4: Comparing Distributions – The distributions for the group the individual patent are graphed. The values are simply the percentages of each matching class for the group and the patent. IMPLEMENTATION Figure 3: Group Distribution – The distribution for a group of patents is graphed with the classes on the x-axis and their percent of occurrence on the y-axis. We hypothesized that the graphs would differ based on the consistency of the group. For example, we expected different curve shapes and different statistics to result. First, we compared the percent contribution to the overall group from the most frequent class. Second, we calculated the mean and standard deviation of class frequencies for the three and five most frequent classes from each group. These smaller subsets highlighted the most influential classes for each group and removed much of the random noise generated by subsequent classes The second analysis we performed involved comparing individual patents to a group in question. Figure 4 illustrates an example of the output generated. One line represents the group and the other represents the individual patent. The x-axis contains all of the classes that are shared between the group and the patent, and the y-axis is of the percent of occurrence. We quantified the data in the graph by scoring the goodness of fit. The score begins with the root mean square error of the two sets of data. One calculates the root mean square error by summing the squared differences of percent composition for each common class, dividing by the number of common classes, and finally taking the square root. We next multiply the Two primary aspects of constructing the tool were designing the back-end and the front-end. The backend of the system needed to interact with existing patent databases to generate the Morphogenetic set. On the other hand, the front-end needed to display all information extracted from the patent database in a user-friendly environment, such as a graphical user interface (GUI). To extract the necessary information from patent databases, we used the programming language Perl. Microsoft Visual Basic (VB) was used to code the front-end. We selected VB because charts, spreadsheets, and summary reports could be easily programmed into our tool. Additionally, because of the familiarity of the Windows-based environment, VB was ideal since the visual display of the buttons, charts, and spreadsheets emulates Microsoft’s Windows operating system and other Microsoft products. RESULTS To assess patent classification using the Morphogenetic sets, our analysis tool compared different scenarios for patent classification. Case Study I compared patent groups and Case Study II compared individual patents with patent groups. The analysis tool used objective metrics to determine the accuracy of classification. 99 Improving Patent Organization Case Study II Results Group Case Study I: Group Consistencies Case Study I analyzed the consistency between different patent groups. The results showed that the mean and standard deviation increased with an increase in the accuracy of the patent group definition. A summary of means and standard deviations of the first three and first five classes for all four groups examined appears in Figure 5. The graph illustrates the tendencies for means and standard deviations to move together and increase as the group becomes more specific. Random Group Poorly defined subclass Facsimile Better defined subclass Pump Well defined concept VLIW Case Study I Results Top 3 Classes Mean Std Dev 4.50 0.13 Top 5 Classes Mean Std Dev 4.19 0.44 11.25 3.99 8.79 4.44 18.19 10.18 13.65 9.54 27.65 21.06 17.49 20.43 Top 5 Classes 20.00% 18.00% 16.00% Mean 14.00% Random 4956133 vs. Random Group Poorly defined sub-class 5682421 vs. Facsimile Better defined sub-class 5380159 vs. Pump Well defined concept 5832205 vs. VLIW 4901307 vs. CDMA Score 9.92 6.65 5.70 2.01 0.82 Figure 6: Scores – The score decreases from the random group to the well defined concept because the individual patent becomes more consistent with the group. CONCLUSIONS Summary The project created a tool to compare different classes for accuracy and consistency of USPTO’s patent classification system. We compared different groups of patents and assessed their consistencies. This analysis revealed stark contrasts in consistencies of patent groups. The results from Case Study I showed that the mean and standard deviation increased with an increase in the accuracy of the patent group definition. The results from Case Study II showed a decrease in RMSE, between the distributions of an individual patent and its respective patent group, with an increase in the accuracy of the patent group definition. 12.00% 10.00% Interpretation 8.00% 6.00% 4.00% 2.00% 0.00% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% Standard Deviation Figure 5: Means and Standard Deviations – The means and standard deviations all increase from the random group to the well defined concept. When plotted, a near perfect line results. Case Study II: Individual Patent vs. Patent Group Case Study II analyzed the patent classification accuracy when compared to a patent group. The results showed a decrease in score between the distribution of an individual patent with its respective patent group, with an increase in the accuracy of the patent group definition. A randomly selected patent from a random patent group produced a high score, while a randomly selected patent from a well defined concept group produced a low score. Figure 6 summarizes the results for Case Study II. 100 The results from Case Study I showed a direct relationship between the accuracy of patent group definition and the mean and standard deviation of the frequencies in the Top 3 and 5 classes of the Morphogenetic set. In more accurately defined patent groups the Morphogenetic set is concentrated in fewer classes than less accurately defined groups. Therefore, the mean and standard deviation for the Top 3 and 5 classes increased with an increase in the accuracy of the patent group definition. In addition, the graph of the Morphogenetic set’s class distribution illustrates this result. An accurately defined patent group resulted in a more convex graph of its Morphogenetic set class distribution, as shown in Figure 7. The results suggested that a patent taken from an accurately defined patent group would resemble its respective patent group in terms of its Morphogenetic set, as illustrated in Figure 8. Figure 8 is a comparison of the IBM keyword search for CDMA vs. Patent 2001 Systems Engineering Capstone Conference • University of Virginia More accurate Base example Less accurate Figure 7: Group Distribution – Example distributions for a group of patents are graphed. The figure compares distributions of varying accuracy. 4901307; this represents the case of a patent taken from a well defined patent group. Figure 4 is a comparison of Patent Class 358/305 vs. Patent 5682421; this represents the case of a patent taken from a poorly defined patent group. In each case, the darker line shows the class frequency distribution for the patent group while the lighter line shows the distribution for the individual patent. The graphs in Figures 4 and 8 illustrate the lower scores resulting from the comparisons of individual patents to more accurately defined groups. solidify our conclusions and suggest problematic subclasses. Secondly, the tool does not have the ability to compare one group of patents to another group. This comparison is extremely beneficial between a class and a concept; this would assess how similar a class of patents is to the class’ concept. Additional analysis for patent group comparisons will demonstrate how changes in mean and standard deviation correlate to a change in accuracy of patent classification. Further analysis of Case Study II will reveal how changes in the RMSE score reflect changes in the accuracy of a patent group’s definition and an individual patent’s classification. Preliminary results from the analysis tool provided promising insight into the accuracy of patent classification. Conducting additional research will validate this use of this analysis tool to assess the accuracy and consistency of patent classification. REFERENCES Chen, H. and S. Dumais. 2000. “Bringing Order to the Web: Automatically Categorizing Search Results.” Proceedings of CHI’00, Human Factors in Computing Systems: 145-152. Coy, P. “The Creative Economy.” Business Week 28 August 2000: 75-214. Deerwester, S., S. Dumais and R. Harshman. “Indexing by Latent Semantic Analysis.” Journal of the Society for Information Science 41(6): 391-407. Figure 8: Comparing Distributions – The distributions for the group the individual patent are graphed for CDMA. The values are simply the percentages of each matching class for the group and the patent. Recommendations The data illuminates errors and inconsistencies in the USPTO classification system, but a drawback of this analysis method is that it requires prior knowledge of classification to interpret and draw conclusions from the class distributions provided by the analysis tool. Our Capstone group has several recommendations for the future. Optimally, one could apply the tool to every sub-class of the USPTO and every patent classified by the USPTO. Such a technique would Koch, T. “The role of classification schemes in Internet resource description and discovery.” DESIRE – Development of a European Service for Information on Research and Education Deliverable 3.2.3 – 19 February 1999. Online. Internet. 30 September, 2000. Available: http://www.ub.lu.se/desire/radar/reports/ D3.2.3/class_v10.rtf. “MCAM Doors.” MCAM. Online. Internet. 4 March 2001. Available: http://www.m-cam.com Molholt, P. “Qualities of Classification Schemes for the Information Superhighway.” Cataloging & Classification Quarterly Volume 21, Number 2, 1995. pp 19-22. Rivette, K. G. and D. Kline. “Discovering New Value in Intellectual Property.” Harvard Business Review January 2000 v78. 101 Improving Patent Organization BIOGRAPHIES Jason Eshler is a fourth-year Systems Engineering major from Richmond, VA. His principal contributions to the project included Perl, Visual Basic coding for the analysis tool and unlocking the door to the FERG room. Jason has accepted a position with Providian Financial. John Messina is a fourth-year Systems Engineering student from Glenview, IL. His principal contributions to the project focused on statistics and data analysis as well as unnecessary work creation for Mr. Ricketts and pretending “to-know.” Mr. Messina will be working for Salomon Smith Barney’s Investment Banking Division in Chicago following graduation. Rina Paz is a fourth-year Systems Engineering major from Alexandria, VA. Her principal contribution to this project was data testing and often pretending “notto-know.” Rina is moving to L.A. to expand her horizons and may be attending medical school upon graduation if the witness protection program does not accept her. Jason Ricketts is a fourth year Systems Engineering major from Stamford, CT. Jason’s principal contributions were Perl and Visual Basic coding, data analysis and spending hours in the FERG room pretending to work. He has accepted a position in New York with PricewaterhouseCoopers’ Management Consulting Services where he hopes to kiss more a$$ than he did during his tenure at UVA. 102