Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Document Engineering Robin Burke ECT 360 Outline Admin Quiz + Answers Document Engineering In-Class Exercise Admin Project Milestone #2 identify domain for project supposed to be due last week • no submission link will be due today Project Milestone #3 document analysis due 10/10 Quiz 30 minutes Document Engineering Glushko and McGrath coined this term "a new discipline for specifying, designing and implementing the documents that serve as the interfaces to business processes." Topic much larger than XML XML provides a mechanism for the results of such an engineering activity Central insight The concept of "document" is very stable and central to many business processes XML technologies allow systems to consume and produce documents Tasks in Document Engineering 1. Analyzing the Context 2. Analyzing/Design Business Processes 3. collecting documents and analyzing their contents extracting components Assembling Components & Models 6. examining the boundaries of processes and seeing what documents go in and out Analyzing Documents & Components 5. express processes at the level where we can identify documents as input and output Analyzing/Design Business Transactions 4. what is the problem to be solved put components together into data models and document models Implementation writing XML schemas writing code that accepts, manipulates and outputs XML In this class Mostly interested in #6 defining XML languages writing code But languages must come from somewhere process and content analysis to derive requirements For your project You will select a domain in which you can find existing documents assume that the first three steps are complete • you know that these are the important documents to represent You will try to figure out what about these documents needs to be represented document analysis Document Analysis 1. 2. 3. 4. 5. 6. Collect representative documents Examine documents Identify information-bearing components Identify their role in the relevant business process Name them Type them Components Any piece of information that has a unique label or identifier is a candidate component is self-contained and comprehensible on its own is a candidate component A component is a logical unit, with no presentation implied may be organized structurally Components Just because information is presented as a unit doesn't mean it is one component Example • "Robert J. Glushko and Tim McGrath" Just because information is not presented together doesn't mean the components should be separate Example • Depaul University • School of CTI • 243 S. Wabash Ave. Hints for Components Spatial features of documents whitespace rules boxes layout patterns Typography font sizes and styles • not always Proximity figures and captions Structure be careful! document may not have the right structure • better to pull out internal information components • see if the structure emerges from the analysis What to record Tentative name must be tentative; names change frequently Type of data Example http://www.nytimes.com/2005/10/03/th eater/newsandfeatures/03wilson.html ?pagewanted=1&th&emc=th Example 2 http://www.internest.com/brittanyhome s/brittanyhomes4340.asp?Print=on Example 3 http://www.irs.gov/pub/irspdf/f1040ez.pdf Component Harvest For each document extract components Do so independently of other documents lets you identify differences in representation and contents Harvesting http://www.internest.com/brittanyhome s/brittanyhomes4340.asp?Print=on http://www.internest.com/rivercity/river city12226.asp?Print=on Component Consolidation Examine different sets of harvested components Look for similarities and differences Try to resolve differences • Renaming • Structural reorganization Develop detailed type information value standardization Standardizing Values Assists in writing schema/DTD Assists in document processing BUT value space not likely to remain constant too many choices doesn't help • 180 countries but do you do business with all of them? too few choices is also a problem • if the distinctions important to your process can't be captured Naming Names are critical Names are the most dynamic part of the analysis they communicate what each part of the document is for expect them to change several times useful to have a dictionary nearby In consolidation we need to merge synonyms come up with new names for homonyms • usually best to rename all homonyms Example title (in a lecture series) Is it the title of the talk? The job title of the speaker? The name of the lecture series? longer names needed to specify Talk Title Series Title Job Title Exercise Each group to get liner notes documents Produce harvest tables Produce consolidated table Switch documents see how they fit Next week Schemas Next project milestone