Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
FEISGILTT 2012 October 16-17 Seattle, Washington Making Data Mining of XLIFF Artefacts Relevant for the On-going Development of the XLIFF Standard Asanka Wasala, David Filip, Chris Exton and Reinhard Schäler Outline Introduction Background Related work Methodology Data collection Preprocessing Construction of the XLIFF corpus XLIFF data mining and construction of the database Analysis Results Discussion How to Make Data Mining of XLIFF Artefacts Relevant for the Ongoing Development of the XLIFF Standard? 3 Introduction “Interoperability is the ability of two or more systems or components to exchange information and to use the information that has been exchanged” (IEEE, 1991) Data exchange formats play a prominent role in facilitating interoperability Standardization of data exchange formats: Difficult due Essential forto successful constantlyinteroperability evolving nature of technologies, businesses, processes and tools Standards need to be constantly reviewed and updated 4 Introduction Framework to evaluate data exchange standard usage Provides empirical evidence and statistics related to the actual usage of different element, attributes, element-values and attribute-values Use the emprical evidance to inform the development and maintenance of standards Main Experiment -“Addressing Interoperability issues in Localisation Processes” How to identify the limitations of data exchange standards and implementations that are leading to interoperability issues? What elements, attributes and their values are leading to interoperability issues? What are the most important elements/attributes, and element- attribute values for interoperability? 5 Introduction XLIFF 6 Background Localisation XLIFF support in commonly used tools (Bly 2010) Matrix containing tools and their level of support for individual XLIFF elements Open source CAT tool named “Virtaal” (Morado-Vázquez and Wolff 2011) Compare its level of XLIFF support with the matrix presented by Bly (2010) Weakness in Bly’s (2010) analysis: does not take into account the relative importance of different parts of the XLIFF specification Simplification of XLIFF attributes: "approved" and "state" XLIFF Support in CAT tools – XLIFF P&L Sub-Committee (Morado-Vázquez and Filip, D. 2012) Tracks quarterly changes in XLIFF support in major CAT tools Limitations of XLIFF (Imhof 2010; Anastasiou and Morado-Vázquez 2010) XLIFF’s extensibility, segmentation and inline elements, complexity 7 Background Localisation XLIFF↔LCX Comparison (Wasala et al. 2010) Improvements to XLIFF and LCX Interoperability issues associated with the XLIFF standard LocConnect – SOA L10n interoperbility testing framework (Wasala et al. 2011) XLIFF as the messaging format LIVE DEMO – 17th OCT. 10:15 – 11:10 (Federated Track) CMSL10n ↔ SOLAS Integration as an ITS 2.0 ↔ XLIFF Test Bed 8 Background Most attempted identifying lacks and deficits of implementations and standards using top-down approaches top down approaches = tools analysis, conceptual frameworks Most of the previous studies present issues, only a few present solutions. Standard compliance, conformance and interoperability of tools and technologies claiming to support standards, have not yet been adequately addressed. Propose an analytical framework that provides empirical evidences and statistics related to the actual usage of different elements, attributes and attribute values of a standard. 9 Methodology 1) Construction of a corpus 2) Construction of a repository 3) Data profiling (designing usage-analysis metrics) 4) Analysis of the results 10 Methodology 1) Construction of a corpus 2) Construction of a repository 3) Data profiling (designing usage-analysis metrics) 4) Analysis of the results 11 Methodology Data Collection Center for Next Generation Localisation (CNGL) industrial partners 3 Companies Crawling Google + Python scripts Crawled on two occasions: on 26th and 29th August 2011 FileType:xlf, xliff, xml+xliff +”trans-unit” +body XLIFF corpus is most likely not representative of all the XLIFF files used in the real world 12 Methodology Cleaning & Preprocessing (crawled files) Cleaning of the file names e.g. dialog.xlf-spec=svn8-r=8 (Python script) Removal of non-XLIFF files 1st pass: Python script (regex matching) 2nd pass: Manual analysis of files Encoding conversion ASCII/UTF-32/UTF-16/BE/LE UTF-8 (Without BOM) (Python script) Removal of XML directive and DOCTYPE declarations During the manual analysis Extraction of embedded XLIFF content During the manual analysis Removal of duplicated files (by analysing content) Using a freely available tool (Auslogics Duplicate File Finder) 13 Methodology The 1st XLIFF Corpus 14 Methodology The 1st XLIFF Corpus Company A Company B Company C 38 29 1004 8.26 MB Crawled Eclipse 444 1664 3179 XLIFF files File sizes & content vary 15 16.8 MB Methodology 1) Construction of a corpus 2) Construction of a repository 3) Data profiling (designing usage-analysis metrics) 4) Analysis of the results 16 Methodology Validate Attributes Children Tags Tables Database (~ 1 GB) Corpus 17 Python scripts to validate, extract and populate tables with data Methodology XLIFF Data Mining select tags from db order by frequency desc SQL Queries select attributes from db where value!=“” Database select children, tag from db where source like “company A” Corpus 18 Methodology 1) Construction of a corpus 2) Construction of a repository 3) Data profiling (designing usage-analysis metrics) 4) Analysis of the results 19 Analysis Usage Metrics Research Questions How to identify the syntactic conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features of individual organizations (not employed by others) Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 20 Methodology 1) Construction of a corpus 2) Construction of a repository 3) Data profiling (designing usage-analysis metrics) 4) Analysis of the results 21 Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 22 How to identify the syntactic conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification and common validation errors 23 Results Validation Errors Overall Validation Results 3000 2500 Number of files 2000 not validated 1500 invalid valid strict valid 1000 500 0 xliff 1.0 xliff 1.1 xliff 1.2 XLIFF Version 24 undefined Validation Errors Invalid Transitional Strict 25 Results Validation Errors Element '{urn:oasis:names:tc:xliff:document:1.2}file', attribute 'target-language': 'tbd' is not a valid value of the atomic type 'xs:language'. Element '{urn:oasis:names:tc:xliff:document:1.2}trans-unit': Duplicate key-sequence ['Export'] in key identity-constraint '{urn:oasis:names:tc:xliff:document:1.2}K_unit_id'. Element '{urn:oasis:names:tc:xliff:document:1.2}trans-unit': The attribute 'id' is required but missing. Element '{urn:oasis:names:tc:xliff:document:1.2}file', attribute 'tool': The attribute 'tool' is not allowed. Element '{urn:oasis:names:tc:xliff:document:1.2}file', attribute '{okapi-framework:xliffextensions}inputEncoding': No matching global attribute declaration available, but demanded by the strict wildcard. Element '{urn:oasis:names:tc:xliff:document:1.2}group', attribute '{http://www.gs4tr.org/schema/xliff-ext}segmented': No matching global attribute declaration available, but demanded by the strict wildcard. 26 Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 27 How to simplify the standard? Identify least frequently used and never used features Features that can be removed 28 Results Least Frequently Used and Never Used Features Relative usage of element X = Number of times X appeared in the corpus x 100 Average use of an element/attribute in the corpus Least frequently used elements/attributes = Relative usage < 1% Least frequently used ≠ Not important (e.g. XLIFF version attribute) Weight of elements/attrbutes: Content, Structure, Presentation 29 Results Least Frequently Used and Never Used Features sub (0.48), seg-source (0.70) prop-group (0.25), prop (0.47) reference (0.00), internal-file (0.11), external-file (0.18) skl (0.16) ex (0.07), bx (0.07), it (0.28), g (0.93) bin-target (0.01), bin-source (0.09), bin-unit (0.09) Used in 1 source/organization Used in 2 or more sources/organizations 30 Results Least Frequently Used and Never Used Features Parent 31 Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 32 How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change 33 Results Commonly Used Features Across Organizations bin-unit Company 2 Company 1 internalfile glossary x ph Commonly used features header xliff phase skl seg-source 34 bx Results Commonly Used Features Across Organizations <xliff> <file> <body> <header> <trans-unit> <source> <target> <external-file> <group> <ph> <alt-trans> <note> - version - original, source-language, target-language, tool, build-num, product-name, product-version - id, approved, translate, resname, restype - xml:space - state, xml:lang - href - resname, restype - id - from 5 sources/organizations More than 3 sources/organizations 35 Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 36 How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted 37 Results Frequently Added Extensions http://www.gs4tr.org/schema/xliff http://www.idiominc.com/ws/asset http://www.w3.org/1999/xhtml http://www.sap.com urn:xmarker http://cmf.zope.org/namespaces/default/ http://www.gdf.com/xmlns/gdf-xstr.xsd urn:ektron:xliff http://sdl.com/FileTypes/SdlXliff/1.0 http://www.tektronix.com http://www.ontram.de/XLIFF-Sup-V1 http://www.w3.org/2001/XMLSchema http://www.crossmediasolutions.de/ rtwsk-extensions 38 Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 39 How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability 40 Feature Usage Patterns Company B 41 Company C Eclipse Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 42 How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features of individual organizations Allows assessment and evaluation of organizations individualistic standard usage practices 43 Results Frequently used Features of Individual Organizations 44 Frequently used Features of Individual Organizations 45 Analysis Usage Metrics Research Questions How to identify the conformance issues of the standard? Identify validation errors Degree of syntactic conformance to the specification, common validation errors How to simplify the standard? Identify least frequently used and never used features Features that can be removed How to identify the most influential features of the standard? Identify most commonly used features across organizations Features that would have widest effects in case of a change How to identify the candidate features that can be introduced to the standard? Identify frequently added extensions and custom features Frequently/most commonly used custom features that might be standardized in future, different extensions in use where new features can be adopted How to identify the best usage practices of the standard? Feature usage patterns To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote best-usage practices of the standard to improve tools interoperability How to identify where organizations deviates from the norm w.r.t. usage of the standard? Identify frequently used features (of individual organizations) not employed by others Allows assessment and evaluation of organizations individualistic standard usage practices How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 46 How to resolve semantic ambiguities of features of the standard? Attribute-values and element-values Helps to identify features leading to semantic conflicts 47 Results Attribute-Values and Element-Values source-language, target-language and xml:lang: en, EN, en-US, en-us, EN-US, en_US, ENGLISH, x-dev, unknown, uz-UZ-Cyrl, tbd date: 04/02/2009 23:24:18, 11/06/2008, 2001-04-01T05:30:02 2006-11-24, 2007-01-01, 2010-03-16T21:58:27Z match-quality: 100,100%, 78.46, fuzzy, String, Guaranteed match-quality: final,needs-review,needs-review-l10n,needs-reviewtranslation,needs-translation,new,signedoff,translated,updated,user:translated,x-reviewed 48 Results Other Possibilities 49 Results Tools Involved Snap-On Ireland, CATFile_Translation_UtilityNAIL.LUI 1.6.0.21409 Idiom WorldServer 9.0.5AgdaVS export tool Benten Ektron IGD-2-XLIFF ITS Translate Decorator Maxprograms JavaPM Okapi.Utilities.Set01 Pleiades Swordfish III blancoNLpackGenerator genrb Idiom WorldServer 9.2.0 LKR PASSOLO 3.0 TM-ABCgenrbEktron Rainbow v2.00Pleiades 50 Results Use of different extensions (e.g .xlf, .xliff, .xml) Use of different encoding mechanisms (e.g. utf-8, utf-8 bom, utf-16) Inconsistencies of using DOCTYPE declaration and XML declarations Never used attributes: alttranstype, annotates, assoc, clone, comment, extype Never used predefined attribute values: e.g. lisp for datatype attribute Use of the improper syntax for user defined attribute values (i.e. not using the 'x-' prefix); e.g. 'text' for 'datatype' attribute; Use of extreme values e.g. Extremely lengthy strings for IDs, spaces within IDs; Use of improper formats Language (i.e. not as specified in BCP 47/RFC5646) Date (i.e. not as specified in ISO 8601 Format ) Omission of required attributes e.g. version attribute of the <xliff> element Use of custom values instead of the predefined values e.g. the use of pofile value instead of the predefined po value for the datatype attribute .. 51 Discussion XLIFF Data Mining Framework Framework Preprocessing Crawling 52 Data Mining Data Analysis Discussion First large XLIFF corpus and a novel empirical framework Can analyse the use of the specification focus on the aspects of the standard that are obviously more important or less important to the actual stakeholders of the standard Quick reference method to check the status of actual implementation of dubious elements, attributes, attribute values, usage patterns First framework that employs a systematic bottom-up approach for identifying important criteria for standardization process Can be applied for similar XML based file formats in other domains for improving important aspects of interoperability 53 Discussion Low External Validity XLIFF corpus is most likely not representative of all the XLIFF files used in the real world 54 Discussion How to Make Data Mining of XLIFF Artifacts Relevant for the Ongoing Development of the XLIFF Standard? 55 Discussion Imagine that you could run the previously shown queries on a representative corpus! 56 Discussion Change SotA Report methodology to include sample files for the corpus • XLIFF P&L SC currently finalizes work on the 2nd edition of the SotA report • • 57 XLIFF SotA Report 2nd Ed DRAFT At the same time we’re kicking off preparations for the 3rd and 4th editions • 3rd edition will add XLIFF 2.0 to the mix • 4th edition depends on designing and approving the process of creation of the SC warranted corpus for empirical evaluation of XLIFF feature support Discussion We want to hear from you if you are willing to contribute to the creation of the corpus 58 Discussion Considerations for making the corpus Proportion Confidentiality Process stage Well formedness Etc. 59 Thank you! This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at the Localisation Research Centre (Department of Computer Science and Information Systems), University of Limerick, Limerick, Ireland. Thanks to Dr. Ian O'Keeffe and Dr. Jim Buckley for all the guidance, suggestions and feedback. We would like to thank the CNGL industrial partner organisations that contributed to the XLIFF corpus. 60