Download Revitalizing Data in Historical Documents

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Database model wikipedia , lookup

Forecasting wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Revitalizing Data in Historical Documents
Challenge
Dusty documents, old books, microfilmed
paper, government archival documents--can we make some of the valuable
information in them machine-readable?
Can we capture their data contained in
tables, records, and page geometry? Can
we organize, index, and search
geometrically laid-out data?
Extracting hand recorded lists of name,
sickness, and causes of death would help
research inherited diseases. Extracting
hand recorded lists of emigration
certificates, land purchases, and
employment records would help research
the past. Extracting printed pages of
archival scientific experiments would help
scientists access vast (old) stores of data.
The challenges are: turn paper that contains repetitive facts laid-out in a human readable geometric page
layout into computer-readable data, extract record groupings from the spatial coordinates of the raw data,
index geometrically based data, and integrate information from multiple sources.
Vision
We apply research and technology from conceptual-model knowledge-bases to address these
challenges. We limit ourselves to extracting knowledge contained on paper by means of forms, tables,
and page-laid-out data. We leverage coordinatized raw data to find, combine, and interpret ``higher'' units
of data embedded in paper documents in geometric patterns. Once found, the conceptual-model
knowledge-base provides the theory to organize, index, search, and retrieve ``higher'' units of inferred
data.
Research
We assume (a) machine readable raw characters (b) their location on a page, (c) the location of
separating lines are provided either from an optical character recognizer (for printed pages) or by a
human extractor (for handwritten pages). Raw horizontal and vertical lines are extractable by patternmatching and image-processing techniques. We use the line data and the raw character layout to infer
patterns and fields of data.
So, how do we build conceptual-model knowledge-bases to achieve this objective? How do we exploit
lines, relative positions, and raw characters to infer logical connectivity? How do we recognize table
headers and automatically infer factored attributes? How do we unravel nested structures? How can we
determine the boundary between groups of information? How do we knit local fields of data into a globally
coherent picture? How do we index ``layout and embedded raw characters'' to support human-issued
queries?