* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download XPEDIA: XML Processing for Data Integration by Amit
Survey
Document related concepts
Transcript
Manish Bhide, Manoj K Agarwal
IBM India Research Lab
India
{abmanish,
manojkag}@in.ibm.com
Amir Bar-Or, Sriram
Padmanabhan
IBM Software Group,
USA
{baroram,srp}@us.ibm.com
Srinivas K. Mittapalli, Girish
Venkatachaliah
IBM Software Group
India
{smittapa,girish}@in.ibm.com
XPEDIA - Introduction
XPEDIA stands for “XML Processing for Data
Integration”
XML documents became popular
XPEDIA is designed to improve data integration for
XML documents
XPEDIA uses parallelization and ELT flow
ETL In Databases
Extract, transform, and load (ETL):
Extracting data from outside sources
Transforming data to fit operational needs
Loading it into the end target (database or data
warehouse)
Typical ETL Scenario With XML
Typical ETL Scenario With XML
Zoom-In Flow-1
The Read_XML_Table
operator simply reads
the XML Documents
XML Hierarchy Tree
The Equi-Hierarchical Join
operator
The operator goes over all the
“Country” sub-tree in the xml
The operator finds the set of
employees working in each
department in that country
The operator creates new
element named “Dept2”
which contain a list of all
employees working in that
department
The Aggregation operator
The operator calc the total salary
of all the employees in a
department
The operator adds the calc to the
XML document as “totalSalary”.
The Shredder Operator
The operator writes the totalSalary in the modified XML document to the
relational database.
Problem
Today, databases support a limited representation of XML
documents
Processing an XML document, requires full extraction
and parsing of the document
XML documents grow larger with time
A need for complex transformations has arose
Problem – Computational Model
Relational data is represented in the form of rows and
columns
In this model, each XML document is represented as a
single row and a single column.
There is a need for a technique that handles complex
data flows while preserving the simple specification
Problem – Scalability
In relational data, the size of a row/tuple is
seldom larger than a few KB’s
XML documents, which are composed
of many small objects, often gets to over 1GB
The Solution – XPEDIA
Computational Model
ELT Support
Scalability – parallelism
XPEDIA Computational Model
XPEDIA Computational Model
XPEDIA uses a dataflow model consisting of operators
and edges
The key difference in XPEDIA model:
The data that flows between operators is an ordered list
of XML documents that comply with a single XML
schema
Example
<High_Value_Customers>
<City name=“Haifa”/>
<Customer name=“Amit”/>
.
.
.
<Customer name=“Shay”/>
</High_Value_Customers>
<Haifa>
List: <Customer_Vector, City_Vector>
<Amit, …, Shay>
XPEDIA Computational Model (cont.)
Operators can iterate over
a sub-vector of a document object
The iterated vector is defined
as “scope” vector of the operator
XML Operators
Filter operator:
Filters one of the vectors within a scope
• Project Operator:
―
Iterates over a single vector and generates a new output vector
that is based on a set of select expressions
XML Operators – Aggregate
Operator
Produces statistics by aggregating one of the vectors.
The aggregation restarts for each scope item
XML Operators – Equi-HierarchicalJoin
Performs an equality based join between two vectors
that are contained within the scope instance
XML Operators – Read/Write Table
Read Table Operator
Reads all the rows of a single table and outputs a
relational tuple or XML document
Write Table Operator
Used for writing a relational or XML data into a table
XML Operators – Output Stage
Operator
Input:
Department
Project
Emp ID
/Company/Country/Dept
/Company/Country/Emplyee/PName
/Company/Country/Emplyee/Einfo/EmpID
ELT
ELT (Extract, Load, Transform)
Take parts of the ETL job flow and converts it into
SQL/XML queries
ELT is a technique to gain efficiency and performance
by shifting a significant processing into the database
ELT In XPEDIA
Databases such as DB2 9, Oracle 11g and SQL Server
2005 have inbuilt XQuery and SQL/XML query
engines.
XPEDIA applies rewriting techniques to transform
parts of the ETL job flow into SQL/XML
How XPEDIA converts ETL to ELT
The following tasks are required for converting
ETL to ELT:
1. Rewrite the ETL flow in terms of simpler operators.
2. Convert each operator into a SQL/XML query.
3. Merge the SQL/XML queries of adjacent operators
into a single SQL/XML query.
4. Convert the merged SQL/XML queries to an ELT job
definition which can be executed on XPEDIA.
Simplify The ETL Flow
Most of the operators in XPEDIA can be directly
converted to a SQL/XML query
Complex operators, like the OutputStage, are difficult
to translate to SQL/XML queries directly
We need to rewrite complex
operators with a simpler operators
Example
The algorithm to convert the OutputStage operator to
the set of simpler operators
Step 1: Apply XMLize operator on the relational data to
obtain flat XML document
Example (cont.)
Step 2:
Example (cont.)
Step 3: Use Project Operator to add and drop nodes, so as to bring
the height of all output node at correct position.
Step 4: Use Project Operator to change names of nodes
Query Generation and Merging
The XPEDIA ELT optimizer has a set of algorithms for
converting operators to SQL/XML query.
The XPEDIA ELT optimizer uses a set of rules for
merging these SQL/XML queries..
Generating The ELT Job Definition
The generated SQL/XML queries are mapped to the
XPEDIA job definition
XPEDIA translates the job definition to a Read Table
operator and the rest of the ETL flow remain the same
The Result
We can now use a single SQL/XML query to replace
the operators between the XML data source to RDBMS
ELT allows us to use only Read/Write table operators
Benefits: reduction of the size of the data that needs to
be moved
XPEDIA ELT Conclusion
XPEDIA is able to use the native XML processing
capabilities of the database engine to greatly improve
performance.
If the database does not have native XML support or is
present in a flat file, XPEDIA can not use the ELT
optimizer
Parallel Processing of XML Data
XPEDIA supports 2 types of job parallelism:
Pipeline: each operator is handled by a different
resource
Partitioning: the XML document is divided into several
partitions, each processed separately
Pipelining Limitations
Pipelining limits the scalability – can only use as much
resources as the number of operators
In pipelining, each resource will need to work on the
entire data
By using partitioning, we allow better usage of
available resources on large documents
Partitions Generation
XPEDIA identifies what nodes are optimal for
partitioning
The chosen partition is than divided between
resources in one of the following methods:
Round Robin
Chunking Scheme
Shallow Parsing
Dividing the work requires some parsing
The parsing that is done is only partial, from root node
to partition node
Since shallow parsing overhead is different for every
partition, sometimes load balancing is done when
choosing chunks sizes
What Have We Gained With
XPEDIA
performance gain of up to 70% by using XPEDIA ETL
tools so that more processing is done inside the
database engine.
Using XPEDIA to partitioning the ETL job on multiple nodes is
scalable and can improve the processing speed of the ETL job
by up to 2.9 times for a 4 processor configuration
Summary
We saw how the XPEDIA deals with this new problems
that arose
Parallel processing techniques is used for handling
large XML document
XPEDIA ELT system is able to take advantage of the
native XML processing capabilities of the database
engine and greatly improve performance.
Questions ?