Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR) motivation ..… demo schema extraction program o Output schema o Field extraction programs for all fields in the schema output schema o XML-like: sequence and structure Seq([blue] Struct(Name: [green] String, City: [yellow] String)) field extraction program o An ancestor o A program in the DSL Examples o Green = <Blue, PRegion> o Yellow = <, PSeqRegion> data extraction DSL o o DSL is a tuple (G, N1, N2) o G : grammar defining extraction strategies o N1 : top-level SeqRegion nonterminal o N2 : top-level Region nonterminal Each non-terminal has a learn method core algebra o Decomposable Map Operator o Filter Operators o Merge Operator o Pair Operator city example city example 1. Filter lines that end with “WA” city example Filter lines that end with “WA” 2. Map each selected line to a pair of positions 1. city example Filter lines that end with “WA” 2. Map each selected line to a pair of positions 3. Learn two leaf exprs for the two positions 1. learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o The lines that hold the city o The pair that identifies the city within a line learning algorithm o Inductive on the grammar structure o Learn city = learn a map operator o o The lines that hold the city o The pair that identifies the city within a line Learn lines = learn a Boolean filter inductive synthesis 1. Problem Definition: Identify a vertical domain of tasks that users struggle with 2. Domain-Specific Language (DSL): Design a DSL that can succinctly describe tasks in that domain 3. Synthesis Algorithm: Develop an algorithm that can efficiently translate examples into likely programs in DSL 4. Machine Learning: Rank the various programs 5. User Interface: Provide an appropriate interaction mechanism to resolve ambiguities pros & cons o o Advantages o Efficient synthesizer o Easier ranking control o Tighter integration with user interaction model Disadvantages o Non-constructive: require thinking & implementation o Non-modular: DSL is not extensible inductive meta-synthesis o A synthesizer for a related family of DSLs that supports a common user interaction model o Alleviate disadvantages of the generic methodology inductive meta-synthesis o Identify a family of vertical task domains o Design an algebra for DSLs o Implement a search algorithm for each algebra operator inductive meta-synthesis o Identify a family of vertical task domains o Design an algebra for DSLs o Implement a search algorithm for each algebra operator extraction meta-synthesis o Identify a family of vertical task domains o o Design an algebra for DSLs o o Extraction of semi-structured documents Merge, Map, FilterBool, FilterInt, Pair Implement a search algorithm for each algebra operator o Compositional and inductive learners synthesis algorithm o Top-down o o Top-level SeqRegion, Region symbols N1, N2 Grammar-guided o Grammar built from the algebra operators key insight o Reduce learning task for an expression to learning tasks for its sub-expressions o Examples: Learn Map (λx : F, S) o Learn the scalar expression F o Learn the sequence expression S instantiations o Text files o Web pages o Spreadsheets demo evaluation o Can FlashExtract extract data from real-world files? o How many interactions typically required? o How efficient/real-time is FlashExtract? expressiveness o Can FlashExtract extract data from real-world files? o How many interactions typically required? o How efficient/real-time is FlashExtract? benchmarks o 25 text files o o o o 25 webpages from [1] o o System log files Copied texts from web pages and PDFs Samples from “Pro Perl Parsing” Add two more test cases for each web page 25 spreadsheets o o 7 from [2] that are applicable for extracting 18 from EUSES corpus [1] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. Proc. VLDB Endow., 2010. [2] B. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. effectiveness o Can FlashExtract extract data from real-world files? Yes o How many interactions typically required? 2.36 examples o How efficient/real-time is FlashExtract? efficiency o Can FlashExtract extract data from real-world files? Yes o How many interactions typically required? 2.36 examples o How efficient/real-time is FlashExtract? 0.82s last interaction conclusion o Inductive meta-synthesis o FlashExtract is general o o Text file, web page, spreadsheet instantiations FlashExtract is practical o Extract real-world data, in real time, within a few examples thank you Questions?