Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exercises in Machine Learning as a Tool for Archivists and Records Managers : --A Case Study of Applying Classification and Association Analyses April 11, 2013 Weijia Xu, Ph.D. Manager, Data Mining & Statistics Group Texas Advanced Computing Center University of Texas at Austin How can data mining help archival processing? • What is data mining? – Process to find patterns – Process to identify relationships – Process to help users synthesize multiple information dimensions – Process to transform data into knowledge How can data mining help archival processing? • What is archival processing? – Process to describe the data, e.g. count, read, sort records – Process to help archivist understand categories of records, e.g. records groups and records series – Process to make decisions, e.g. improve records management and access. How can data mining help archival processing? • Data mining can help archivists conduct archival processing. Data Information knowledge Documents Metadata Images … Statistical summaries; Patterns; Relationships; … Historical trends; Prediction of future data; … Data Appraisal Context Records Collections Metadata … Identify the collection’s structure; Discover major topics and outliers. Finding aids; Collection functions, Decisionmaking … A Case Study: State Department Cables • State Cables are the formal record of communication between the State department and the embassies and consulates around the world. • Predefined structure and semantic components – TAGS (Traffic Analysis by Geography and Subject) – Security classification status – Subject, a short phrase describing the content An Exemplar Cable Use Case and Archival Analysis Needs • Understand the content of the collection for purposes of appraisal and description. • Learn more about the functions and context of the collection. • Provide access according to their original security information classification. • Re-classify documents based on new categories. Challenges • Scale of the collection – A collection of over 450, 000 declassified State Department cables from the years 1973 to 1976. • Obscure content – Sizes of the cables are usually short – To fully understand the meaning of the message may require background (historical) knowledge • The access privilege of a document may change over time requiring re-classification of documents or the future addition of a new class Computational Approaches and Goal • Goal: To help archivists to understand the collection in order to provide better access mechanisms. • Our approaches: – Investigated how classification methods could help to infer the categorization of a given cable. – Investigated possible associations between cables. Testing Dataset and Preprocessing • Set of 150,000 declassified State Department cables from the years 1973 and 1974 • Each cable is stored as a single PDF file • Text data and metadata are parsed and stored in a MySQL database for further analysis • iText for parsing PDF files. • MySQL database includes ID, message text, and 60 metadata entries Classification • Classification assigns a class label to each document • How? Two stage approach: first, learning a model, then using the model to infer the correct label Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model 10 Model Training Set Tid Attrib1 11 No Small Attrib2 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Attrib3 Apply Model Class Deduction Support Vector Machine Classification • A distance based classification method. • The core idea is to find the best hyperplane to separate data from two classes. • The class of a new object can be determined based on its distance from the hyperplane. Binary Classification with Linear Separator • Red and blue dots are representations of objects from two classes in the training data • The line is a linear separator for the two classes • The closets objects to the hyperplane is the support vectors. ρ Binary Classification with Linear Separator • There could be multiple separators ρ r • Which one is the best? – The line can maximize the margins. Non-linear SVM • Mapping the original feature space to some higher-dimensional feature space where the training set is separable Φ: x → φ(x) Computational Model • Each cable is represented by a vector (data point) – The value of each data can be determined based on the content/metadata of each document. – We considered three data models based on different values: • Message body • subject+tags • subject only • The category of each cable is modeled as the class label. • Each cable has one class label. Implementation and Experiments • SVM classifier is developed with libsvm Java library • In each test, two subsets of cables are randomly selected for training and evaluation from the 1973/1974 collection. • The classification result is measured by – Accuracies: The overall percentage of correctly classified cables in evaluation set. – True positive rates: The percentage of cables that have been correctly classified for a particular class. Classification Results with Different Data Model • Classification accuracies when training and evaluation data are from the same time period (one year) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% message body subjects and TAGS subjects Word Distribution in the Message’s Body • Mean of total words per document: 180 • Std deviation of total words per document: 294 • Mean of total sentences per document: 13 • Std deviation of sentences per document: 22 • Mean of total unique words per doc: 106 • Std deviation of total unique words per doc: 119 Classification Results with Different Learning Strategies • Classification accuracies when training data is from 1973 and evaluation data from 1974. different time periods same time periods 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% subjects and TAGS subjects A “sliding window” test • Sliding window: Subset of cables within 6 months period • Classification results when training and evaluation data are 1 month apart 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Overall Accuracy 0.2 True Positive for Secrete Docuemnt 0.1 0 Dec-72 Mar-73 Jul-73 Oct-73 Jan-74 May-74 Findings and Hypothesis • “Good” results – Content of the message is not a strong indicator of the security classification and results vary with different feature selection methods – TAGS may have the strongest correlation with the security classification • “Bad” results – The correlation between security classification and content may change over the time. A Closer Look of TAGS • TAGS (Traffic Analysis by Geography and Subject) • A controlled coding system issued by the State Department to indicate places, organizations and themes, are carefully selected by department officials to create, store, and distribute the cables. • There are about 14k unique tags from 1973 and 1974 collection. • Select a subset of TAG for association analysis. • Analysis code is developed using Weka. TAGS distribution 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Document Occurance Total Occurance Collection Coverage of Top TAGS Association Analysis • Goal: Identify dependency rules that can predict occurrences of one item based on the co-occurrence of another item • Tasks: – Identify frequent combinations of TAGS – Identify rules that associate a combination of TAGS with a class lables – Compare rules generated for the different years (1973 and 1974) Association Rule Mining • Association Rule: An expression of the form X and Y are itemsets – Example: {Tag 1, Tag 2} {Tag 3} Y, where X • Association Rule Mining: Given a set of transactions T, find all rules that have both X Y • Evaluation Metrics – Support (s) =Fraction of occurrences that contain both X and Y – Confidence (c) =Measures how often tag Y appears in transactions that contain X Exemplar Association Result • The number of rules and the quality of the rule are determined by the support level and confidence level thresholds: • An exemplar rule AORG=t TECH=t IAEA=t 168 ==> DOCUMENT_CLASS=UNCLASSIFIED 167 conf:(0.99) Results of Rules Generation • We generated a list of rules for archivists to analyze. • The ratio of rules generated for each class roughly follows the same ratio of the number of cables in each class. – This is because the ratio between unclassified documents and secret documents is 10:1 – Most rules generated are for unclassified documents • To uncover more rules for secret documents the support level has to be reduced. Combination Matters • An interesting case: – MASS=t SA=t JO=t 80 ==> DOCUMENT_CLASS=SECRET 80 conf:(1) – SA=t JO=t 117 ==> DOCUMENT_CLASS=SECRET 114 conf:(0.97) – MASS=t JO=t 197 ==> DOCUMENT_CLASS=SECRET 190 conf:(0.96) TAG # of Secret # OF UNCLASSIFIED % OF SECRET MASS JO SA YE TC IR MARR PFOR 1078.00 430.00 586.00 307.00 134.00 410.00 1127.00 3655.00 281.00 406.00 371.00 97.00 209.00 533.00 594.00 5633.00 0.79 0.51 0.61 0.76 0.39 0.43 0.65 0.39 % OF UNCLASSIFIED 0.21 0.49 0.39 0.24 0.61 0.57 0.35 0.61 Comparing Rules Generated From Different Years • We reduced the support level and the confidence level thresholds to generate more possible rules for each year • We compared rules generated from one year with the other to detect changes in: – Tag combinations – Inference about the security class they belong to Exemplar Comparison Results 1973 TAGS 1974 TAGS CONFIDENCE SUPPORT [OVIP] 0.71 2678 UNCLASSIFIED [UR] 0.8 3611 UNCLASSIFIED [GE, PARM] [TU] 0.95 0.73 94 847 SECRET UNCLASSIFIED CONFIDENCE SUPPORT [OVIP, EG] 0.8 242 SECRET [OVIP, SY] 0.77 215 SECRET [UR, PARM] 0.93 494 SECRET [UR, NATO] 0.89 241 SECRET [GE] 0.88 3440 UNCLASSIFIED [CY, PFOR, TU] 0.71 734 SECRET [CY, GR, TU] 0.74 665 SECRET [GR, TU] 0.71 833 SECRET [NATO, TU] 0.77 235 SECRET [TU, PINT] 0.72 265 SECRET [CY, TU, PINT] 0.85 207 SECRET [CY, GR, PFOR, TU] 0.72 559 SECRET 0.72 0.79 913 6047 SECRET UNCLASSIFIED 0.78 1959 UNCLASSIFIED [UR, PARM] 0.9 157 SECRET [CY, TU] [UR] [PFOR, IR] 0.75 125 SECRET [IR] Conclusions • We show how classification and association analyses may assist archival processing • Our results indicate that classification is an effective method to categorize records accurately • However, an important assumption is that the accuracy is the consistency between the training data and the test data • Association rule mining may be effective identifying keywords or trend changes in the document collection • The computational process maps with archival processing as a non-linear analysis workflow Acknowledgement • TACC team – Maria Esteva, data archivist. – Jeffery Tang, undergraduate student – Karthik Padmanabhan, graduate student • NARA collaborator – Mark Conrad • Funding support – NARA – NSF Weijia Xu [email protected] 512-232-7158 For more information: www.tacc.utexas.edu A Case Study with State Departments Cables Collection • Introduction – Problem definition – Data description and preprocessing • Applying classification algorithms – – – – Algorithm introduction Same time period Different time period Timeline study • Applying association analysis – Methodology introduction – Association rule mining – Rule comparisons • Conclusions