Download Stream Processing in Emerging Distributed Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Course Project Ideas
Yanlei Diao
University of Massachusetts Amherst
New Directions for DB Research
Sensor data: new architecture
XML: new data model
Streams: new execution model
Data quality and lineage: new services
…
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Querying in Sensor Networks
Internet
Gateway
Push query to
sensors
• Store data locally at
sensors and push queries
into the sensor network
– Flash memory energyefficiency.
– Limited capabilities of sensor
platforms.
Flash Memory
Acoustic stream
Image stream
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Optimize for Flash and Limited RAM
Memory
• Flash Memory Constraints
– Data cannot be over-written, only
~4-10 KB
erased
– Pages can often only be erased in
blocks (16-64KB)
2. Modify in-memory
– Unlike magnetic disks, cannot
modify in-place
1. 1. Load block
3. Save
• Challenges:
2. Into Memory
– Energy: Organize data on flash to
minimize read/write/erase
operations
– Memory: Minimize use of memory
for flash database.
Yanlei Diao, University of Massachusetts Amherst
Erase
block
~16-64 KB
5/22/2017
block back
StonesDB: System Operation
Image Retrieval: Return images taken
last month with at least two birds one
of which is a bird of type A.
Proxy Cache of Image Summaries
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see t his pic ture.
QuickTime™ and a
TIFF ( Uncompressed) decompressor
are needed to see this pictur e.
• Identify “best” sensors to
forward query.
• Provide hints to reduce
search complexity at
sensor.
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
StonesDB: System Operation
Image Retrieval: Return images taken last
month with at least two birds one of which
is a bird of type A.
Query Engine
Partitioned Access Methods
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Research Issues in StonesDB
• Local Database Layer
–
–
–
–
Reduce updates for indexing and aging.
New cost models for self-tuning sensor databases.
Energy-optimized query processing.
Query processing over aged data.
• Distributed Database Layer
– What summaries are relevant to queries?
– What remainder queries to send to sensors?
– What resolution of summaries to cache?
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
XML (Extensible Markup Language)
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML: a tagging mechanism to describe content.
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
XML Data Model (Graph)
db
#0
publisher
book
book
b1
b2
pub
title
#1
pcdata
author
#2
pcdata
#3
pcdata
pub
mkp
title
author
#5
#4
pcdata
author
pcdata
Complete... Chamberlin Principles... Bernstein
Newcomer
name
state
#6
pcdata
#7
pcdata
Morgan... CA
Main structure: ordered, labeled tree
References between node: becoming a graph
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
XQuery: XML Query Language
• A declarative language for querying XML data
• XPath: path expressions
– Patterns to be matched against an XML graph
– /bib/paper[author/lastname=‘Croft’]/title
• FLOWR expressions
– Combining matching and restructuring of XML data
– For
$p in distinct(document("bib.xml")//publisher)
Let
$b := document("bib.xml")/book[publisher = $p]
Where count($b) > 100
Order by $p/name
Return $p
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Metadata Management using XML
• File systems for large-scale scientific simulations
– File systems: petabytes or even more
– Directory tree (metadata): large, can’t fit in memory
– Links between files: steps in a simulation, data derivation
• File Searches
– all the files generated on Oct 1, 2005
– all the files whose name is like ‘*simu*.txt’
– all the files that were generated from the file ‘basic-measures.txt’
 Build an XML store to manage directory trees!
– XML data model
– XML Query language
– XML Indices
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
XML Document Processing
 Multi-hierarchical XML markup of text documents
–
–
–
–
Multi-hierarchies: part-of-speech, page-line
Features in different hierarchies overlap in scope
Need a query language & querying mechanism
References [Nakov et al., 2005; Iacob & Dekhtyar, 2005]
 Querying and ranking of XML data
–
–
–
–
XML fragments returned as results
Fuzzy matches
Ranking of matches
References [Amer-Yahia et al., 2005; Luo et al., 2003]
• Well-defined problems  identify your contributions!
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Data Stream Management
Traditional Database
Data Stream Processor
Results
Results
Query
Attr1 Attr2 Attr3
Data
Queries, Rules
Event Specs,
Subscriptions
•Data at rest
•Data in motion, unending
•One-shot or periodic queries
•Continuous, long-running queries
•Query-driven execution
•Data-driven execution
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
In-Network XML Processing
• XML is becoming the wire format for data
• In-network XML processing
–
–
–
–
–
Authentication
Authorization
Routing
Transformation
Pattern matching
Expedite traffic
Enhance security
Real-time monitoring
& diagnosis
• XPath widely used for in-network XML processing
• Applied directly to streaming XML data
• Line-speed performance
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Research Issues
Gigabit rate XPath processing
– Take one look, process XPath, buffer data for future use if
necessary
– Processing needs to be gigabit rate
– Memory usage needs to be minimized
• Time/space complexity of XPath stream
processing
– Theoretical analysis for common features of XPath
• Minimizing memory usage of YFilter technolgy
– YFilter: state-of-the-art for multi-XPath processing
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
RFID Technology
• RFID technology
01.01298.6EF.0A
04.0768E.001.F0
01.01267.60D.01
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
reader_id,
tag_id,
timestamp
RFID Stream Processing
<pml >
<tag>01.01298.6EF.0A</tag>
<time>00129038</time>
<location>shelf 2</location>
</pml>
RFID reader <pml>
RFID tag
<tag>01.01298.6EF.0A</tag>
Shoplifting: an item was taken out of store without being
checked out.
<time>02183947</time>
Out of stocks: the+number of items of product X on shelf ≤ 3.
Yanlei Diao, University of Massachusetts Amherst
<location>exit1</location>
</pml>
5/22/2017
RFID Processing: Global Tracking
Counterfeit drugs: a bottle is accepted at the retailer if it came from a legal
manufacturer and followed all necessary steps in the distribution network.
<pml>
<pml>
<epc>01.001298.6EF.0A</epc>
<epc>01.001298.6EF.0A</epc>
<pml>
a
bottle
is
accepted
at
the
retailer
it went through the
<ts type=“begin”>
<tsif
type=“end”>
<pml>
<pml>
<pml>
<epc>01.001298.6EF.0A</epc>
<date>…</date>
<date>…</date></ts>
<epc>01.001298.6EF.0A</epc>
<epc>01.001298.6EF.0A</epc>
network
in less
than
3
months
and
was
never exposed
to temperature >
<epc>01.001298.6EF.0A</epc>
<ts><date>…</date></ts>
</ts>
<entity type=“retailer”>
<ts><date>…</date></ts>
<ts><date>…</date></ts>
<ts><date>…</date></ts>
<location>…</location>
<entity type=“maker”>
<name type=“legal”>CVS
<location>…</location>
<location>…</location>
<location>…</location>
<msr
label=“temperature”
<name
type=“legal”>X
Ltd.
</name>
<msr label=“temperature”
<msr label=“temperature”
<msr label=“temperature”
max=2>80</msr>
</name>
</entity> …
max=5>95</msr>
max=2>85</msr>
max=2>90</msr>
…
</entity>
…
…
……
Expired/spoiled drugs:
distribution
+
Missing pallet, expected case, illegally cloned tags…
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
96 F.
Challenges in RFID Management
• Data-Information Mismatch
– RFID raw data: (tag id, reader id, timestamp)
– Meaningful information: shoplifting, misplaced inventory, out-ofstocks; expired drugs, spoiled drugs…
• Incomplete, inaccurate data
– Readers miss tags
– Readers can pick up tags from overlapping areas
• High-volume data
– Readers read constantly, from all tags in range, without line-of-sight
– Can create up to millions of terabytes of data in a single day
• Low-latency processing
– Up-to-the-second information, time-critical actions
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Research Issues
• Real-time event stream processing
– Handling duplicate readings/results
– Data cleaning
– Data compression
• Handling incomplete readings
– Inferences in event databases
– Inferences over event streams
• Distributed processing
– Real time anomaly detection
– Distributed inferences
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Adaptive Sensing of Atmosphere
• Environmental monitoring:
real-time processing of hugevolume meteorological data
• Challenges
–
–
–
–
Large volume but limited bandwidth
Real-time processing
Uncertain data
Data archiving and querying the
history
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Sense
Sense
Send Send
Merge
Detection
Prediction
Managing Uncertain Data
• Sources of data uncertainty
1) Sensing noise and partial scanning
2) Data compression
3) Lossy wireless links
4) Incomplete merging
(1)
(1)
(2)
(2)
(3)
(3)
• Managing uncertain data
– Model sources of data uncertainty
– Develop uncertainty calculus to
combine the effects of these sources
– Augment results with confidence
values
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Merge
(4)
Tornado
Detection
Prediction
(confidence?)
Managing Uncertain Data
• Sources of data uncertainty
1) Sensing noise and partial scanning
2) Data compression
3) Lossy wireless links
4) Incomplete merging
(1)
(1)
(2)
(2)
(3)
(3)
• Self diagnosis and tuning
– Compare predication at t with
observation at t+1 (no ground
truth?!)
– System diagnosis when confidence
value is low
– Automatically tune the system
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Merge
(4)
Tornado
Detection
Prediction
(confidence?)
Questions
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Outline
• An outside look: DB Application
• An inside look: Anatomy of DBMS
• Project ideas: DB Application
• Project ideas: DBMS Internals
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Application: UMass CS Pub DB
• UMass Computer Science Publication Database
– All papers on professors’ web pages and in their DBLP records
– All technical reports
• Search:
– Catalog search (author, title, year, conference, etc.)
– Text search (using SQL “LIKE”)
• Navigation
– Overview of the structure of document collection
– Area-based “drill down” and “roll up” with statistics
•
•
•
•
Add document
Top hits
Example: http://dbpubs.stanford.edu:8090/aux/index-en.html
Deliverables: useful software, user-friendly interface
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Application: RFID Database
• RFID technology
• RFID supply chain
Truck
Pallet
Case
– Locations
– Objects
Manufacturer
Supplier DC
Yanlei Diao, University of Massachusetts Amherst
Retail DC
5/22/2017
Retail Store
Application: RFID Database
• RFID technology
• RFID Supply chain
• Database propagation
– Streams of (reader_id, tag_id, time)
– Semantics: reader_id  location, tag_id  object
– Containment
• Location-based, items in a case, cases on a pallet, pallets in a truck…
• Duration of containment
– History of movement: (object, location, time_in, time_out)
– Data compression for duplicate readings
– Integration with sensors: temperature, location…
• Track and trace queries
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
Data Quality
•
•
Closed world assumption: not any more!
Various sources of data loss
1)
2)
3)
4)
•
Sensing noise
Data compression
Lossy wireless links
Incomplete merging
(1)
(1)
(2)
(2)
(3)
Probabilistic query processing
(3)
Merge
(4)
– Model sources of data loss
– Quantify the effect on queries max(), avg(), percentile…
– Output query results with confidence level
Yanlei Diao, University of Massachusetts Amherst
5/22/2017
• Some idea on INFOD/data dissemination
Yanlei Diao, University of Massachusetts Amherst
5/22/2017