Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Revitalizing Data in Historical Documents Challenge Dusty documents, old books, microfilmed paper, government archival documents--can we make some of the valuable information in them machine-readable? Can we capture their data contained in tables, records, and page geometry? Can we organize, index, and search geometrically laid-out data? Extracting hand recorded lists of name, sickness, and causes of death would help research inherited diseases. Extracting hand recorded lists of emigration certificates, land purchases, and employment records would help research the past. Extracting printed pages of archival scientific experiments would help scientists access vast (old) stores of data. The challenges are: turn paper that contains repetitive facts laid-out in a human readable geometric page layout into computer-readable data, extract record groupings from the spatial coordinates of the raw data, index geometrically based data, and integrate information from multiple sources. Vision We apply research and technology from conceptual-model knowledge-bases to address these challenges. We limit ourselves to extracting knowledge contained on paper by means of forms, tables, and page-laid-out data. We leverage coordinatized raw data to find, combine, and interpret ``higher'' units of data embedded in paper documents in geometric patterns. Once found, the conceptual-model knowledge-base provides the theory to organize, index, search, and retrieve ``higher'' units of inferred data. Research We assume (a) machine readable raw characters (b) their location on a page, (c) the location of separating lines are provided either from an optical character recognizer (for printed pages) or by a human extractor (for handwritten pages). Raw horizontal and vertical lines are extractable by patternmatching and image-processing techniques. We use the line data and the raw character layout to infer patterns and fields of data. So, how do we build conceptual-model knowledge-bases to achieve this objective? How do we exploit lines, relative positions, and raw characters to infer logical connectivity? How do we recognize table headers and automatically infer factored attributes? How do we unravel nested structures? How can we determine the boundary between groups of information? How do we knit local fields of data into a globally coherent picture? How do we index ``layout and embedded raw characters'' to support human-issued queries?