Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BiographyNet Project review, year-1 September, 18th, 2013 eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013 Agenda • • • • • Project objectives and first year results (Piek) Methodology and historian perspective (Serge) Model, conversions and interface (Niels) NLP tools and research (Antske) Discussion BiographyNet Review Meeting, eScience centre, September 18th, 2013 Starting point • http://www.biografischportaal.nl • Academic discipline of writing histories: – – – – computational tools marginally used, long scholarly tradition of study by reading, single authored historical narratives, while more and more historical sources digitally available. • Project challenges: – “Computational thinking in history”: • • – Narrative historians not used to frame research problems in computational terms, while computer-science researchers understand little of the subtleties of historical analysis Strong multi-disciplinary cooperation of front runners in both fields & demonstrator development to achieve common understanding. Methodological and tool support BiographyNet Review Meeting, eScience centre, September 18th, 2013 Contribution to historical research • New research on the Dutch nation building and a revaluation of biographical information. – Bridging a gap between life histories, qualitative historical research, and quantitative historical research. • Open research on less static objects and relations such as events: – most important pieces of information capturing changes and processes that matter. • Capture historiographic perspective: – Requires a model that takes different framings of the same event into account. – Adds to the who-knows-who, when, where and how did the lives of people cross; how did they affect each other’s lives and the world they lived in. – How do and did we conceive historic events, how are different narratives created around the same history? BiographyNet Review Meeting, eScience centre, September 18th, 2013 Expected outcome • Demonstrator on top of the Biography Portal. – Cyclic development. – links within the Biography Portal among the various (textual and visual) datasets • Open-source release of the e-science platform for analyzing biographical texts about people. – Adherence to all relevant Web standards and APIs, maximizing reusability. • Proposal for methodology for extraction of a network of relations between people and (historic) events. BiographyNet Review Meeting, eScience centre, September 18th, 2013 Short term goals 1. Building a richer data repository by connecting different distributed sources of data through formalized links and metadata. 2. Detection of (co-referenced) named-entities (persons, places and dates) and events. 3. Harmonize the texts that vary from 19th century Dutch to contemporary Dutch, where the OCR-ed texts also contain errors. 4. Development of visualization, analytic tools, as well as computational historiographical methods on the structured data that is generated for 1. through 3. BiographyNet Review Meeting, eScience centre, September 18th, 2013 Results first year • Methodology: – Use cases and the anticipation of data- and process-driven biases – Formal modeling of provenance – Sustainability, replication, reproducibility • Software: – Design of interfaces and analytic tools – Text mining and evaluation – Linked Data conversion scripts • Data: – – – – – Linked Data version of the Portal Linking to Agora Discussions with Wikimedia/Wikipedia/Dbpedia & Bibliotheek.nl Verrijkt Koninkrijk HuygensING exploitation to extend the Portal with enriched data produced • 6 accepted papers BiographyNet Review Meeting, eScience centre, September 18th, 2013 BiographyNet and historical approaches to ‘big’ and heterogeneous data eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013 The historian’s role 1. Methodology: Work on a methodology to extract information, relationships and events from short biographical texts 2. Question the data: develop use cases 3. Contribute to the design of a user interface that challenges historians to dig deeper into the data 4. Sensitize target user groups (historians) for both the possibilities and the limitations of computational methods in historical research. 1: Methodology • Year 1 - Historian’s focus: how reliable and representative are the texts from this particular dataset? Which questions can and cannot be answered? How well do ‘tools’ perform, as compared to the performance of a ‘real’ historian? See also publications (below). • Year 1 - Interdisciplinary focus: what is the provenance of the information, how is it manipulated in order to arrive at the answer to a query, and who are responsible for the tools that manipulate those data? 2: Use Cases • 12 cases developed, ranging from ‘simple’ to ‘highly complex’ • Simple: Group analysis of Governors-general of the Dutch Indies • More complex: when did Dutch elites get involved with the ‘New World?’ • Complex: What can we say about nationalism in biographical dictionaries from the nineteenth and twentieth century? Governors-General of the Dutch Indies • • • • Highest Official in the Dutch Indies 1610-1949 71 men What can we say about these men as a group? Who was appointed and what qualities did he have to have? • Etc …. 3: User friendly interface • Mainly work in progress, – Discussion about the impact of a ‘design metaphor’ (like “time line” … , “house of…”, “building blocks for…”, “family tree…”) on the type of questions raised by the user • … presentation Niels. The House of History Time line Family Tree 4: Sensitize target user groups • Publication in Tijdschrift voor Biografie (reaching the nearest target user group of the demonstrator): Serge ter Braake, ‘Het individu en zijn tijdgenoten. Wat een biograaf kan doen met prosopografie en biografische woordenboeken’, Tijdschrift voor Biografie 2 (summer 2013) vol. 2, 52-61. • ‘Biography and Computational Methods’, joint paper in preparation (to be submitted before the end of the month to Journal for Historical Biography (Ter Braake, Ockeloen and Fokkens) • Research on nationalism and national biographies, to be published in 2014 4: Sensitize target user groups • Presentation at Huygens ING, 10 October 2013 (for circa 50 professional historians) • Presentation on provenance at KNAW Digital Humanities Workshop, 14-15 November 2013 • Introduction in e-Humanities in the current curriculum of BA1 students at the Vrije Universiteit (what is eHumanities, how does one use a source like the Oxford Dictionary of National Biography?) • Design and development of a series of electives and a minor on e-history and an e-humanities (BA 2-3; starting 2014/2015). Dataset of BiographyNet will be used in a lab for history bachelor students. BiographyNet Towards the demonstrator eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013 Overview Main components of the demonstrator • Schema to structure the data • Conversion of the BP to Linked Data • NLP system setup • Interface A crash course on Linked Data Online machine readable data with links • Simple facts called ‘RDF Triples’ Thorbecke > hasBirthPlace > Zwolle Some technology concepts: • Schemas: To structure LD • RDF Stores: To store LD • SPARQL: To access LD Huge growth in the past years: • More than 300 data sources • More than 30 billion triples The conversion process Purely syntactic conversion • Preserve the original structure of the data • Prevent los of information • Allow for reinterpretation of the original data in the future The conversion process Conversion steps: • Retrieval of XML dump of the Biography Portal • Initial conversion to ‘crude’ RDF • Using ClioPatria and the XMLRDF tool for ClioPatria • RDF restructuring • Linking to other sources • Essential step in the ‘Linked Data’ philosophy The conversion process Data schema: • Based on the structure of the original XML files • Needs to facilitate the coupling of different biographies of the same person, without compromising the original data • Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc. • Compatible with existing schemas such as the Europeana Data Model, PROV, P-PLAN, DC terms, etc. BiographyNet: Schema illustration http://www.biographynet.nl/schema Provenance: What is it? Provenance information is information on how Entities come into existence • What are entities? • Documents, Articles, Pictures, etc. • Basically anything that can be ‘produced’ by something or someone • What kind of information? • Who did what? • Using which entities? • In which processes? Provenance in BiographyNet For the demonstrator, provenance needs to be modeled: • From several perspectives: • Information involved • Processes involved • People involved • At multiple levels: • An aggregated level, i.e. per enrichment • Detailed level, i.e. all individual processes Why is provenance info important for BiographyNet? Needed to ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool • Historians need to be able to validate results • Replication: Retrieving the same results later using the demonstrator • Reproducibility: Manually by the historian • The aggregated level – Targeted at the historian • Which original sources where involved? • Who to contact in case results are pulled into question? • The detailed level – Targeted at the computer scientist • Detailed information on each individual step • Allows for debugging the internal processing pipeline BiographyNet Enrichment example Provenance Meta Data NNBW “Thorbecke” Biographical Description Person Meta Data Biography Parts Thorbecke Enrichment Biographical Description Event Birth 1798 Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari januari in Zwolle en en komt uit uit een half-Duit half-Duit NLP Tool Person Meta Data Event Birth 1798-01-14 Zwolle More than just Provenance… P-PLAN is not only used to model what actually happened, but also what was supposed to happen • ‘Plans’ describe the original idea behind an activity • Describe what should happen in a certain activity • Each ‘Plan’ corresponds with an ‘Activity’ • ‘Variables’ describe the input/output of an activity • Structure, format, quantity, etc. • Each ‘Variable’ corresponds with an input/output ‘Entity’ of an ‘Activity’ • ‘Plans’ have their own provenance info • E.g. who was responsible for the creation of a plan? Why model plans besides provenance? The benefits of modeling plans: • Forces the recording of what an activity and its input/output should look like • Provides information on the original idea behind an activity • As such, can provide info on possible assumptions and biases • Allows for comparing between the actual activity and its input/output and the original plan and its variables • Do they differ from each other and to what extend? • Makes finding errors much easier, as more information is available about what the input/output should look like BiographyNet: Schema illustration Variable Variable Plan Plan Agent Person Entity Association Agent Activity Entity NLP Tool Activit Recap / Current Status Main components of the demonstrator • Initial schema available (publication LISC @ISWC 2013) • Schema models enrichments and aggregations alongside original sources • Allows for storing various levels of provenance information • Model will be adapted while progressing with building the demonstrator • Initial conversion to Linked Data available • Structure according to schema presented • Next step is linking to external sources • NLP system setup available (Antske) • Interface • Presentation of general outline and ideas Interface: Focus • The interface should be easy to use • The demonstrator should inspire historians to undertake new research and give direction, rather than being the ‘closing factor’ in their research • The interface should allow users to ‘fine tune’ results returned upon an initial action Interface: Options • Query composition • Faceted browsing • A combination Interface: Query composition • Drop down boxes to select ‘Verbs’, data elements and relations Interface: Faceted browsing • No explicit querying, but convergence of the data through browsing and selecting • Provides better feedback to the user • Allows for more direct and easier adjustment of the selected data Interface: Faceted browsing Interface: A combination • Query composition combined with faceted browsing • Create new facets by defining a query – The result of the query is available as a subset of the data by selecting the defined facet – As such, combinable with other facets • Method to integrate ‘open’ querying of the data into a general interface and visualization Interface: A combination Facets Selection Process Data Question Analysis Results Interface: Demonstrator Results ? Time and place are primary elements BiographyNet Text Mining eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013 First year goals for Text Mining • Methodology – Requirements – Approach • Basic System for data enrichment in text – Identify metadata in text – Setup that can easily be improved and extended – (co-referenced) named entities, events – Deal with alternative spelling BiographyNet Review Meeting, eScience centre, September 18th, 2013 Methodology Requirements • Reproducing results in Natural Language Processing is non-trivial • Details in implementations or experimental setup can influence results up to a point where they tell a different story BiographyNet Review Meeting, eScience centre, September 18th, 2013 Reproducing results • Example: Performance of WordNet similarity scores compared to human ranking: BiographyNet Review Meeting, eScience centre, September 18th, 2013 Reproducing results • Clear registration of all steps involved and storage of (intermediate) system output can improve reproducibility • Systematic testing can help to gain insight into the variation of the outcome of our systems and hence lead to more insight in their performance Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013. BiographyNet Review Meeting, eScience centre, September 18th, 2013 Methodology requirements • The method used to extract information may introduce a bias that has unintended influence on the outcome of the historian’s questions • For example: location identification with GeoNames – Heuristic: when multiple locations with the same name, take the one in or closest to the Netherlands – High precision, but `America’, `Willemstad’: what if the historian investigates trips to the Netherlands by officials overseas? BiographyNet Review Meeting, eScience centre, September 18th, 2013 Methodology requirements • Maximize reuse of existing tools for BiographyNet • Maximize reuse of tools developed within BiographyNet by other researchers • How can we create a setup that facilitates this? BiographyNet Review Meeting, eScience centre, September 18th, 2013 Methodology approach • Provenance modeling: – Can help to improve reproducibility of research – Can support systematic testing – Can model the exact steps taken • Flexible formats that support this: – NLP Annotation Format (NAF) to manage output and input of NLP tools – Grounded Annotation Framework (GAF) for the final output of the NLP pipeline BiographyNet Review Meeting, eScience centre, September 18th, 2013 NLP Annotation Format • Sustainable, because close to existing linguistic formats (e.g. LAF, GRAF, NIF) • Joint work across projects and with other institutes (notably University of the Basque Country, Fondazione Bruno Kessler) • Flexible, because the output of individual tools is added in separate layers BiographyNet Review Meeting, eScience centre, September 18th, 2013 Grounded Annotation Framework • RDF compliant framework • Introduces the denotedBy relation that links mentions in text to formal representations of their instances • Provenance is marked using Named Graphs • This allows us to accumulate information from different sources and represent alternative perspectives BiographyNet Review Meeting, eScience centre, September 18th, 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013 Provenance Modeling • It must be clear where information comes from (original source, opinion holder, automatically retrieved or from metadata) • For NLP research: – Model each step of the process – Resources used (preprocessing + version), system output • For historic research: – What may introduce biases? How can the process be represented in an understandable manner? BiographyNet Review Meeting, eScience centre, September 18th, 2013 Basic System • Identifying metadata in text – Linguistically naïve supervised machine learning • Linguistic processing: – Named Entity recognition (time and location) – Concept identification BiographyNet Review Meeting, eScience centre, September 18th, 2013 First Evaluation • Use case: Governor Generals of the Dutch Indies • 129 Biographies describing 71 individuals • Serge ter Braake extracted information manually BiographyNet Review Meeting, eScience centre, September 18th, 2013 Metadata versus text mining 100 90 80 70 60 50 metadata 40 text 30 20 10 0 BiographyNet Review Meeting, eScience centre, September 18th, 2013 Preliminary outcome of text mining Category Correct Incorrect Both Education 2 0 2 Father 0 0 9 2 Mother 0 1 2 5 Occupation 14 6 21 4 Birthdate 21 2 35 9 2 Correct text Incorrect Text Recall problems (for birthdate): 1. Sentence not found (35): typical for wikipedia, bwn, vdaa 2. Value not found (7) 3. Wrong sentence (1), wrong date (1): date of marriage, date of death BiographyNet Review Meeting, eScience centre, September 18th, 2013 Observations • Recall problems (for birthdate): – Sentence identification • Easy ways to improve: – Parents: named entity recognition – Occupation, Education: concept tagged corpus – Source specific training • More difficult problems: – Relations, functions of other people – Negations or factuality (e.g. refused positions for occupations) BiographyNet Review Meeting, eScience centre, September 18th, 2013 NLP outlook • Evaluation: – Text based annotations • Metadata extraction: – Supervised with linguistically rich features – Rule-based approaches • Beyond Metadata: – Time lines of people’s lives (2nd year) – Networks between people (2nd year) – Complex event modeling (3rd year) BiographyNet Review Meeting, eScience centre, September 18th, 2013 Questions? http://www.biographynet.nl/ eScience Center 18 September 2013 BiographyNet Review Meeting, eScience centre, September 18th, 2013