Download Hands 2010 - Paper Abstract - The University of Sheffield

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Answering Medieval Authorship Questions using e-Science
Michael Meredith & Peter Ainsworth, The University of Sheffield, UK
Overview
The provenance and authorship of medieval manuscripts has long been a common research question across
multiple disciplines of the arts and humanities. Key questions include: Where and by whom were these
manuscripts created?, and What does the codicological evidence – scribal hands, catchwords, page layouts,
artistic styles in the miniatures and marginal decoration – suggest about book production in this period? This
paper focuses on 15th-century manuscripts (Froissart’s Chronicles) that were made during the Hundred Years’
War and shows how e-Science is allowing us to explore and aid the humanist in identifying the characteristic
stylistic, orthographic and iconographic ‘signatures’ of particular scribes and artists, through the use of image
recognition and data mining techniques. The application of computer-science algorithms employed in this
research includes image edge detection, polygonal model fitting and geometric comparisons. Additionally, we
discuss how these technologies are applicable to other datasets, assisting scholars to dig into image data to
determine authorship-related questions.
Methodology
The research described in this paper is being undertaken as part of the larger Digging into Image Data
collaboration between an international, multi-disciplinary team of researchers from the University of Sheffield,
UK, Michigan State University, USA, and the University of Illinois at Urbana-Champaign, USA. The
collaboration has been assembled to investigate the topic of authorship across a diverse set of image data: 15thcentury Froissart manuscripts, 17th- and 18th-century digitised maps, and 19th- and 20th-century digitised quilts.
One of the primary questions we are attempting to answer is whether adaptive image analytics can attribute
authorship and, if so, how accurate and computationally scalable they are when applied to diverse collections of
image data.
The first step in answering this question is for computer scientists to gather an understanding of what tell-tale
signatures scribes leave by working closely with humanists; this first step isn’t necessarily one-sided as software
algorithms can transform images into different domains and spectrums that might assist the scholar better to
describe characteristic features. Armed with this understanding, research into image analytics is undertaken to
extract these discriminating features in order to construct a statistical digital signature of the author. The results
from the research undertaken by the computer scientists are examined by the domain experts and analytically
compared using samples of data with known provenance. Future work within the overall project will use the
outputs from these stages and, using high-performance computing, search across a large collection of images and
datasets to cluster together images that appear to be of similar authorship. The humanists and art historians will
comment on these findings.
Scholarly Interpretation of Medieval Manuscripts
The manuscripts considered within the project consist of text written in a reasonably well-defined medieval
cursive hand. In order to construct the particular digital signature of a scribe we focus on a number of areas from
which we can extrapolate meaningful patterns; it is important to note that we do not expect any individual
characteristic area to be sufficient for identifying a particular scribe, it is only when several are considered
together that we can draw meaning from the results. Some of these areas do however raise further research
questions that, together, the scholar and computer scientist should be able to answer.
Use of Abbreviations
Figure 1 illustrates how abbreviations can be used to indicate differences between scribes by juxtaposing folios
from two different manuscripts, each written by different scribes copying similar texts1. The scribe on the right
consistently abbreviates their ‘et’s, except when they immediately follow a punctuation mark, for example.
However, before we can use abbreviations as a potential identifier, further questions need to answered:
1) Were both scribes given the same physical space to write the text (are the lengths of text comparable)?
2) Do scribes consistently use abbreviations across their work?
3) Is there a pattern to the use of abbreviations that perhaps the scribe has developed?
1
Although we have aligned the text within the two manuscripts according to their accepted equivalences (i.e. corresponding
sections), and indeed their text very closely correlate, we acknowledge that manuscript stemmas show that is unlikely the two
scribes responsible for authoring these folios used exactly the same source
The e-Science research outlined later in this paper demonstrates how we can answer these questions, further
feeding into the larger and encompassing question of authorship.
Flow of Text
Figure 2 demonstrates how some scribes are better at filling the available space; the scribe in the right-hand side
of the image better justifies the text across the column. There are also examples where the scribe slowly curves
the text the further he/she gets down the page – i.e. the rows are no longer perfectly horizontal, even once
digitisation shears are removed; scribes wrote the text on a flat surface. In order to validate these observations
for inclusion within a digital signature, we must first determine whether a scribe is consistently neat or whether
other factors influence this, such as experience, and turn-around times. The algorithms used across this research
project will allow us to harvest and quickly a large volume of data to analyse on which to base further
conjecture.
Individual Letters and Words
Within a body of text, scholars can quickly point to forms that appear to be characteristic of a particular scribe.
Figure 3 illustrates how two different scribes finish the tail of their ‘g’s in different directions. Other potential
tell-tale indicators between different scribes include ‘y’s, ‘est’s and ‘estre’s.
The use of abbreviations, flow of text and letter forms demonstrates only a subset of differences that we are
investigating. Other areas include the ductus of the text, and pen flourishes as being potential contributing
factors to a scribe’s signature.
Computer-aided Research Application
The potential contributing indicators of a scribe’s digital signature can be addressed using algorithms based in
the geometric pattern matching domain. Applying an edge detection map (based on the Sobel convolution) to
the source image, we fit a polygonal model around the text for geometric comparisons. Two different ways of
fitting a suitably refined polygonal model to this data are evaluated: Least Squares and statistical EM
(Expectation, Maximisation). The extracted geometry provides the basis for further data analysis techniques
such as Principle Component Analysis (to help answer questions regarding the overall “impression” of the text)
and geometric-based image retrieval techniques (i.e. Shape comparisons), where significant contour similarities
are useful to help mitigate against a degree of noise in the edge map and polygonal model fitting.
We further couple the geometric analysis approach with work arising from the Online Froissart research project 2.
As part of the Online Froissart project, many of the manuscript images in our dataset have been analysed and
synchronised to a non-diplomatic transcription which allows both the image and text to be zoomed, panned and
manipulated together. The synchronisation of text with image offers the ability to perform a textual search on
patterns of characters (such as 'est') and find every occurrence between any two points of a manuscript with line
accuracy on the image itself. We can also estimate how far along the line such occurrences are before we need
to apply any further image analysis and identification.
By combining these technologies we automatically count, crop and highlight specific sections from the large
corpus of image data we have in order to draw further comparisons and conclusions. In particular regions that
we know were written by a specific scribe, the data we harvest is used to determine the validity of the suggested
indicators. When they are consistent, future work will fit probability models to them for larger-scale data
mining. The overall approach also lends itself well to identifying general patterns of words, abbreviations,
flourishes, etc, within our dataset for which we have no supporting transcription synced to the image, or in the
case of abbreviations, no indicators within the transcription as to where they occur.
Future Work
The research outline above provides only one part of the Digging into Image Data research project. Our partners
are similarly working on complementary algorithms to identify authorship across the map and quilt collections.
The algorithms that are researched at each site will be cross-pollinated and applied to all the datasets (geometric
shape recognition to identify reoccurring patterns on a quilt, for example), which will help us extract more robust
salient characteristics in relation to determining authorship. This will be assisted with the inclusion of machinelearning techniques. The collective algorithms will then be applied across a large corpus of data using highperformance computing and the results analysed and reported by our research team.
2
See http://www.hrionline.ac.uk/onlinefroissart
Illustrations and Figures
MS 1 f 1v
Start of equivalent sections
MS 864 f 1v
MS 864 f 97v
MS 865 f 90v
Corresponding uses of ‘et’ (abbreviated on the right hand side image)
Right hand scribe writes ‘et’ as opposed to abbreviates – is there a pattern when he/she uses ‘et’ compared to the
abbreviated form? ‘et’ would appear to be written when it follows a “comma” or very early on within a sentence?
Further examples of abbreviated forms by the scribe in the right hand side image (many more examples are visible
on the whole folio)
Figure 1: Use of abbreviations between two different manuscripts/scribal hands
Scribe ‘B’s letter G
Figure 2: Flow of text
Scribe ‘C’s letter G
Figure 3: Example of different letter ‘g’s between scribal hands