Download Font Options: Calibri, Arial, San Serif

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Exercises in Machine Learning as a Tool for
Archivists and Records Managers :
--A Case Study of Applying Classification and Association
Analyses
April 11, 2013
Weijia Xu, Ph.D.
Manager, Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin
How can data mining help archival
processing?
• What is data mining?
– Process to find patterns
– Process to identify relationships
– Process to help users synthesize multiple
information dimensions
– Process to transform data into knowledge
How can data mining help archival
processing?
• What is archival processing?
– Process to describe the data, e.g. count, read, sort
records
– Process to help archivist understand categories of
records, e.g. records groups and records series
– Process to make decisions, e.g. improve records
management and access.
How can data mining help archival
processing?
• Data mining can help archivists conduct
archival processing.
Data
Information
knowledge
Documents
Metadata
Images
…
Statistical
summaries;
Patterns;
Relationships;
…
Historical
trends;
Prediction of
future data;
…
Data
Appraisal
Context
Records
Collections
Metadata
…
Identify the
collection’s
structure;
Discover major
topics and
outliers.
Finding aids;
Collection
functions,
Decisionmaking …
A Case Study: State Department Cables
• State Cables are the formal record of
communication between the State department
and the embassies and consulates around the
world.
• Predefined structure and semantic components
– TAGS (Traffic Analysis by Geography and Subject)
– Security classification status
– Subject, a short phrase describing the content
An Exemplar Cable
Use Case and Archival Analysis Needs
• Understand the content of the collection for
purposes of appraisal and description.
• Learn more about the functions and context
of the collection.
• Provide access according to their original
security information classification.
• Re-classify documents based on new
categories.
Challenges
• Scale of the collection
– A collection of over 450, 000 declassified State
Department cables from the years 1973 to 1976.
• Obscure content
– Sizes of the cables are usually short
– To fully understand the meaning of the message
may require background (historical) knowledge
• The access privilege of a document may change
over time requiring re-classification of
documents or the future addition of a new
class
Computational Approaches and Goal
• Goal: To help archivists to understand the
collection in order to provide better access
mechanisms.
• Our approaches:
– Investigated how classification methods could
help to infer the categorization of a given cable.
– Investigated possible associations between cables.
Testing Dataset and Preprocessing
• Set of 150,000 declassified State Department cables
from the years 1973 and 1974
• Each cable is stored as a single PDF file
• Text data and metadata are parsed and stored in a
MySQL database for further analysis
• iText for parsing PDF files.
• MySQL database includes ID, message text, and 60
metadata entries
Classification
• Classification assigns
a class label to each
document
• How? Two stage
approach: first,
learning a model,
then using the model
to infer the correct
label
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
10
Model
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Support Vector Machine Classification
• A distance based classification method.
• The core idea is to find the best hyperplane to
separate data from two classes.
• The class of a new object can be determined
based on its distance from the hyperplane.
Binary Classification with Linear Separator
• Red and blue dots are
representations of
objects from two
classes in the training
data
• The line is a linear
separator for the two
classes
• The closets objects to
the hyperplane is the
support vectors.
ρ
Binary Classification with Linear Separator
• There could be
multiple separators
ρ
r
• Which one is the
best?
– The line can maximize
the margins.
Non-linear SVM
• Mapping the original feature space to some
higher-dimensional feature space where the
training set is separable
Φ: x → φ(x)
Computational Model
• Each cable is represented by a vector (data point)
– The value of each data can be determined based on the
content/metadata of each document.
– We considered three data models based on different
values:
• Message body
• subject+tags
• subject only
• The category of each cable is modeled as the class
label.
• Each cable has one class label.
Implementation and Experiments
• SVM classifier is developed with libsvm Java
library
• In each test, two subsets of cables are randomly
selected for training and evaluation from the
1973/1974 collection.
• The classification result is measured by
– Accuracies: The overall percentage of correctly
classified cables in evaluation set.
– True positive rates: The percentage of cables that have
been correctly classified for a particular class.
Classification Results with Different Data Model
• Classification accuracies when training and evaluation data are
from the same time period (one year)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
message body
subjects and TAGS
subjects
Word Distribution in the Message’s
Body
• Mean of total words per
document: 180
• Std deviation of total words
per document: 294
• Mean of total sentences per
document: 13
• Std deviation of sentences
per document: 22
• Mean of total unique words
per doc: 106
• Std deviation of total unique
words per doc: 119
Classification Results with Different Learning
Strategies
• Classification accuracies when training data is from 1973 and
evaluation data from 1974.
different time periods
same time periods
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
subjects and TAGS
subjects
A “sliding window” test
• Sliding window: Subset of cables within 6 months period
• Classification results when training and evaluation data
are 1 month apart
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Overall Accuracy
0.2
True Positive for Secrete Docuemnt
0.1
0
Dec-72
Mar-73
Jul-73
Oct-73
Jan-74
May-74
Findings and Hypothesis
• “Good” results
– Content of the message is not a strong indicator
of the security classification and results vary with
different feature selection methods
– TAGS may have the strongest correlation with the
security classification
• “Bad” results
– The correlation between security classification
and content may change over the time.
A Closer Look of TAGS
• TAGS (Traffic Analysis by Geography and Subject)
• A controlled coding system issued by the State
Department to indicate places, organizations and
themes, are carefully selected by department officials
to create, store, and distribute the cables.
• There are about 14k unique tags from 1973 and 1974
collection.
• Select a subset of TAG for association analysis.
• Analysis code is developed using Weka.
TAGS distribution
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Document Occurance
Total Occurance
Collection Coverage of Top TAGS
Association Analysis
• Goal: Identify dependency rules that can
predict occurrences of one item based on the
co-occurrence of another item
• Tasks:
– Identify frequent combinations of TAGS
– Identify rules that associate a combination of
TAGS with a class lables
– Compare rules generated for the different years
(1973 and 1974)
Association Rule Mining
• Association Rule: An expression of the form X
and Y are itemsets
– Example: {Tag 1, Tag 2} {Tag 3}
Y, where X
• Association Rule Mining: Given a set of transactions T, find all
rules that have both X  Y
• Evaluation Metrics
– Support (s)
=Fraction of occurrences that contain both X and Y
– Confidence (c)
=Measures how often tag Y appears in transactions that
contain X
Exemplar Association Result
• The number of rules and the quality of the
rule are determined by the support level and
confidence level thresholds:
• An exemplar rule
AORG=t TECH=t IAEA=t 168 ==>
DOCUMENT_CLASS=UNCLASSIFIED 167
conf:(0.99)
Results of Rules Generation
• We generated a list of rules for archivists to analyze.
• The ratio of rules generated for each class roughly
follows the same ratio of the number of cables in each
class.
– This is because the ratio between unclassified documents
and secret documents is 10:1
– Most rules generated are for unclassified documents
• To uncover more rules for secret documents the
support level has to be reduced.
Combination Matters
• An interesting case:
– MASS=t SA=t JO=t 80 ==> DOCUMENT_CLASS=SECRET 80 conf:(1)
– SA=t JO=t 117 ==> DOCUMENT_CLASS=SECRET 114 conf:(0.97)
– MASS=t JO=t 197 ==> DOCUMENT_CLASS=SECRET 190 conf:(0.96)
TAG
# of Secret
# OF UNCLASSIFIED
% OF SECRET
MASS
JO
SA
YE
TC
IR
MARR
PFOR
1078.00
430.00
586.00
307.00
134.00
410.00
1127.00
3655.00
281.00
406.00
371.00
97.00
209.00
533.00
594.00
5633.00
0.79
0.51
0.61
0.76
0.39
0.43
0.65
0.39
% OF
UNCLASSIFIED
0.21
0.49
0.39
0.24
0.61
0.57
0.35
0.61
Comparing Rules Generated From
Different Years
• We reduced the support level and the
confidence level thresholds to generate more
possible rules for each year
• We compared rules generated from one year
with the other to detect changes in:
– Tag combinations
– Inference about the security class they belong to
Exemplar Comparison Results
1973
TAGS
1974
TAGS
CONFIDENCE SUPPORT
[OVIP]
0.71
2678
UNCLASSIFIED
[UR]
0.8
3611
UNCLASSIFIED
[GE, PARM]
[TU]
0.95
0.73
94
847
SECRET
UNCLASSIFIED
CONFIDENCE SUPPORT
[OVIP, EG]
0.8
242
SECRET
[OVIP, SY]
0.77
215
SECRET
[UR, PARM]
0.93
494
SECRET
[UR, NATO]
0.89
241
SECRET
[GE]
0.88
3440
UNCLASSIFIED
[CY, PFOR, TU]
0.71
734
SECRET
[CY, GR, TU]
0.74
665
SECRET
[GR, TU]
0.71
833
SECRET
[NATO, TU]
0.77
235
SECRET
[TU, PINT]
0.72
265
SECRET
[CY, TU, PINT]
0.85
207
SECRET
[CY, GR, PFOR, TU]
0.72
559
SECRET
0.72
0.79
913
6047
SECRET
UNCLASSIFIED
0.78
1959
UNCLASSIFIED
[UR, PARM]
0.9
157
SECRET
[CY, TU]
[UR]
[PFOR, IR]
0.75
125
SECRET
[IR]
Conclusions
• We show how classification and association analyses may
assist archival processing
• Our results indicate that classification is an effective
method to categorize records accurately
• However, an important assumption is that the accuracy is
the consistency between the training data and the test data
• Association rule mining may be effective identifying
keywords or trend changes in the document collection
• The computational process maps with archival processing
as a non-linear analysis workflow
Acknowledgement
• TACC team
– Maria Esteva, data archivist.
– Jeffery Tang, undergraduate student
– Karthik Padmanabhan, graduate student
• NARA collaborator
– Mark Conrad
• Funding support
– NARA
– NSF
Weijia Xu
[email protected]
512-232-7158
For more information:
www.tacc.utexas.edu
A Case Study with State Departments
Cables Collection
• Introduction
– Problem definition
– Data description and preprocessing
• Applying classification algorithms
–
–
–
–
Algorithm introduction
Same time period
Different time period
Timeline study
• Applying association analysis
– Methodology introduction
– Association rule mining
– Rule comparisons
• Conclusions