Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course Project Ideas Yanlei Diao University of Massachusetts Amherst New Directions for DB Research Sensor data: new architecture XML: new data model Streams: new execution model Data quality and lineage: new services … Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Querying in Sensor Networks Internet Gateway Push query to sensors • Store data locally at sensors and push queries into the sensor network – Flash memory energyefficiency. – Limited capabilities of sensor platforms. Flash Memory Acoustic stream Image stream Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Optimize for Flash and Limited RAM Memory • Flash Memory Constraints – Data cannot be over-written, only ~4-10 KB erased – Pages can often only be erased in blocks (16-64KB) 2. Modify in-memory – Unlike magnetic disks, cannot modify in-place 1. 1. Load block 3. Save • Challenges: 2. Into Memory – Energy: Organize data on flash to minimize read/write/erase operations – Memory: Minimize use of memory for flash database. Yanlei Diao, University of Massachusetts Amherst Erase block ~16-64 KB 5/22/2017 block back StonesDB: System Operation Image Retrieval: Return images taken last month with at least two birds one of which is a bird of type A. Proxy Cache of Image Summaries Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see t his pic ture. QuickTime™ and a TIFF ( Uncompressed) decompressor are needed to see this pictur e. • Identify “best” sensors to forward query. • Provide hints to reduce search complexity at sensor. Yanlei Diao, University of Massachusetts Amherst 5/22/2017 StonesDB: System Operation Image Retrieval: Return images taken last month with at least two birds one of which is a bird of type A. Query Engine Partitioned Access Methods Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Research Issues in StonesDB • Local Database Layer – – – – Reduce updates for indexing and aging. New cost models for self-tuning sensor databases. Energy-optimized query processing. Query processing over aged data. • Distributed Database Layer – What summaries are relevant to queries? – What remainder queries to send to sensors? – What resolution of summaries to cache? Yanlei Diao, University of Massachusetts Amherst 5/22/2017 XML (Extensible Markup Language) <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML: a tagging mechanism to describe content. Yanlei Diao, University of Massachusetts Amherst 5/22/2017 XML Data Model (Graph) db #0 publisher book book b1 b2 pub title #1 pcdata author #2 pcdata #3 pcdata pub mkp title author #5 #4 pcdata author pcdata Complete... Chamberlin Principles... Bernstein Newcomer name state #6 pcdata #7 pcdata Morgan... CA Main structure: ordered, labeled tree References between node: becoming a graph Yanlei Diao, University of Massachusetts Amherst 5/22/2017 XQuery: XML Query Language • A declarative language for querying XML data • XPath: path expressions – Patterns to be matched against an XML graph – /bib/paper[author/lastname=‘Croft’]/title • FLOWR expressions – Combining matching and restructuring of XML data – For $p in distinct(document("bib.xml")//publisher) Let $b := document("bib.xml")/book[publisher = $p] Where count($b) > 100 Order by $p/name Return $p Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Metadata Management using XML • File systems for large-scale scientific simulations – File systems: petabytes or even more – Directory tree (metadata): large, can’t fit in memory – Links between files: steps in a simulation, data derivation • File Searches – all the files generated on Oct 1, 2005 – all the files whose name is like ‘*simu*.txt’ – all the files that were generated from the file ‘basic-measures.txt’ Build an XML store to manage directory trees! – XML data model – XML Query language – XML Indices Yanlei Diao, University of Massachusetts Amherst 5/22/2017 XML Document Processing Multi-hierarchical XML markup of text documents – – – – Multi-hierarchies: part-of-speech, page-line Features in different hierarchies overlap in scope Need a query language & querying mechanism References [Nakov et al., 2005; Iacob & Dekhtyar, 2005] Querying and ranking of XML data – – – – XML fragments returned as results Fuzzy matches Ranking of matches References [Amer-Yahia et al., 2005; Luo et al., 2003] • Well-defined problems identify your contributions! Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Data Stream Management Traditional Database Data Stream Processor Results Results Query Attr1 Attr2 Attr3 Data Queries, Rules Event Specs, Subscriptions •Data at rest •Data in motion, unending •One-shot or periodic queries •Continuous, long-running queries •Query-driven execution •Data-driven execution Yanlei Diao, University of Massachusetts Amherst 5/22/2017 In-Network XML Processing • XML is becoming the wire format for data • In-network XML processing – – – – – Authentication Authorization Routing Transformation Pattern matching Expedite traffic Enhance security Real-time monitoring & diagnosis • XPath widely used for in-network XML processing • Applied directly to streaming XML data • Line-speed performance Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Research Issues Gigabit rate XPath processing – Take one look, process XPath, buffer data for future use if necessary – Processing needs to be gigabit rate – Memory usage needs to be minimized • Time/space complexity of XPath stream processing – Theoretical analysis for common features of XPath • Minimizing memory usage of YFilter technolgy – YFilter: state-of-the-art for multi-XPath processing Yanlei Diao, University of Massachusetts Amherst 5/22/2017 RFID Technology • RFID technology 01.01298.6EF.0A 04.0768E.001.F0 01.01267.60D.01 Yanlei Diao, University of Massachusetts Amherst 5/22/2017 reader_id, tag_id, timestamp RFID Stream Processing <pml > <tag>01.01298.6EF.0A</tag> <time>00129038</time> <location>shelf 2</location> </pml> RFID reader <pml> RFID tag <tag>01.01298.6EF.0A</tag> Shoplifting: an item was taken out of store without being checked out. <time>02183947</time> Out of stocks: the+number of items of product X on shelf ≤ 3. Yanlei Diao, University of Massachusetts Amherst <location>exit1</location> </pml> 5/22/2017 RFID Processing: Global Tracking Counterfeit drugs: a bottle is accepted at the retailer if it came from a legal manufacturer and followed all necessary steps in the distribution network. <pml> <pml> <epc>01.001298.6EF.0A</epc> <epc>01.001298.6EF.0A</epc> <pml> a bottle is accepted at the retailer it went through the <ts type=“begin”> <tsif type=“end”> <pml> <pml> <pml> <epc>01.001298.6EF.0A</epc> <date>…</date> <date>…</date></ts> <epc>01.001298.6EF.0A</epc> <epc>01.001298.6EF.0A</epc> network in less than 3 months and was never exposed to temperature > <epc>01.001298.6EF.0A</epc> <ts><date>…</date></ts> </ts> <entity type=“retailer”> <ts><date>…</date></ts> <ts><date>…</date></ts> <ts><date>…</date></ts> <location>…</location> <entity type=“maker”> <name type=“legal”>CVS <location>…</location> <location>…</location> <location>…</location> <msr label=“temperature” <name type=“legal”>X Ltd. </name> <msr label=“temperature” <msr label=“temperature” <msr label=“temperature” max=2>80</msr> </name> </entity> … max=5>95</msr> max=2>85</msr> max=2>90</msr> … </entity> … … …… Expired/spoiled drugs: distribution + Missing pallet, expected case, illegally cloned tags… Yanlei Diao, University of Massachusetts Amherst 5/22/2017 96 F. Challenges in RFID Management • Data-Information Mismatch – RFID raw data: (tag id, reader id, timestamp) – Meaningful information: shoplifting, misplaced inventory, out-ofstocks; expired drugs, spoiled drugs… • Incomplete, inaccurate data – Readers miss tags – Readers can pick up tags from overlapping areas • High-volume data – Readers read constantly, from all tags in range, without line-of-sight – Can create up to millions of terabytes of data in a single day • Low-latency processing – Up-to-the-second information, time-critical actions Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Research Issues • Real-time event stream processing – Handling duplicate readings/results – Data cleaning – Data compression • Handling incomplete readings – Inferences in event databases – Inferences over event streams • Distributed processing – Real time anomaly detection – Distributed inferences Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Adaptive Sensing of Atmosphere • Environmental monitoring: real-time processing of hugevolume meteorological data • Challenges – – – – Large volume but limited bandwidth Real-time processing Uncertain data Data archiving and querying the history Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Sense Sense Send Send Merge Detection Prediction Managing Uncertain Data • Sources of data uncertainty 1) Sensing noise and partial scanning 2) Data compression 3) Lossy wireless links 4) Incomplete merging (1) (1) (2) (2) (3) (3) • Managing uncertain data – Model sources of data uncertainty – Develop uncertainty calculus to combine the effects of these sources – Augment results with confidence values Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Merge (4) Tornado Detection Prediction (confidence?) Managing Uncertain Data • Sources of data uncertainty 1) Sensing noise and partial scanning 2) Data compression 3) Lossy wireless links 4) Incomplete merging (1) (1) (2) (2) (3) (3) • Self diagnosis and tuning – Compare predication at t with observation at t+1 (no ground truth?!) – System diagnosis when confidence value is low – Automatically tune the system Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Merge (4) Tornado Detection Prediction (confidence?) Questions Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Outline • An outside look: DB Application • An inside look: Anatomy of DBMS • Project ideas: DB Application • Project ideas: DBMS Internals Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Application: UMass CS Pub DB • UMass Computer Science Publication Database – All papers on professors’ web pages and in their DBLP records – All technical reports • Search: – Catalog search (author, title, year, conference, etc.) – Text search (using SQL “LIKE”) • Navigation – Overview of the structure of document collection – Area-based “drill down” and “roll up” with statistics • • • • Add document Top hits Example: http://dbpubs.stanford.edu:8090/aux/index-en.html Deliverables: useful software, user-friendly interface Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Application: RFID Database • RFID technology • RFID supply chain Truck Pallet Case – Locations – Objects Manufacturer Supplier DC Yanlei Diao, University of Massachusetts Amherst Retail DC 5/22/2017 Retail Store Application: RFID Database • RFID technology • RFID Supply chain • Database propagation – Streams of (reader_id, tag_id, time) – Semantics: reader_id location, tag_id object – Containment • Location-based, items in a case, cases on a pallet, pallets in a truck… • Duration of containment – History of movement: (object, location, time_in, time_out) – Data compression for duplicate readings – Integration with sensors: temperature, location… • Track and trace queries Yanlei Diao, University of Massachusetts Amherst 5/22/2017 Data Quality • • Closed world assumption: not any more! Various sources of data loss 1) 2) 3) 4) • Sensing noise Data compression Lossy wireless links Incomplete merging (1) (1) (2) (2) (3) Probabilistic query processing (3) Merge (4) – Model sources of data loss – Quantify the effect on queries max(), avg(), percentile… – Output query results with confidence level Yanlei Diao, University of Massachusetts Amherst 5/22/2017 • Some idea on INFOD/data dissemination Yanlei Diao, University of Massachusetts Amherst 5/22/2017