Download Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
BIOMEDICAL DATA INTEGRATION
BASED ON
METAQUERIER ARCHITECTURE
ADVISOR : KHONDKER SHAJADUL HASAN
CO – ADVISOR : JAVED SIDDIQUE
GROUP MEMBERS
-NAIEEM KHAN
-EUSUF ABDULLAH MIM
-M SAMIULLAH CHOWDHURY
Three basic parts of the project
 DATA INTEGRATION
 METAQUERIER ARCHITECTURE
 BIOMEDICAL DATA
DATA INTEGRATION
What does it mean?
Data integration is the process of combining
data residing at different sources and
providing the user with a unified view of these
data. This process emerges in a variety of
situations both commercial (when two similar
companies need to merge their databases)
and scientific (combining research results from
different bioinformatics repositories) .
Data integration appears with increasing
frequency as the volume and the need to
share existing data explodes.
DATA INTEGRATION
Simple schematic for a data warehouse. The information from
the source databases is extracted, transformed then loaded into
the data warehouse
DATA INTEGRATION
Difficulties of Data Integration
Huge web database
Database content are now dynamic
Necessity of efficient data crawler
Accurate and perfect Query Interfaces
Time efficiency
Depth
Volume of data handling
DATA INTEGRATION
Importance
Integration from web databases.
In order to get necessary information from different sources data integration is
very important.
In order to get a Large scale Integration
Efficient and accurate query answers.
Consider a user, who is moving to a new town. To start with, different queries
need different sources to answer: Where can she look for real estate
listings? (e.g., realtor.com.) Studying for a new car? (cars.com.) Looking
for a job? (monster.com.) Further, different sources support different query
capabilities: After source hunting, the user must then learn the gruelling
details of querying each source.
METAQUERIER ARCHITECTURE
METAQUERIER ARCHITECTURE
There are different approaches and paradigms for data integration, some of
them are• Materialized: physical, integrated repository is created here.
• Data Warehouses: physical repositories of selected data extracted from a
collection of DBs and other information sources.
• Mediated: data stay at the sources, a virtual integration system is created.
• Federated and cooperative: DBMSs are coordinated to collaborate.
• Exchange: Data is exported from one system to another.
• Peer-to-Peer data exchange: Many peers exchange data without a central
control mechanism. Data is passed from peer to peer upon request,
as query answers.
METAQUERIER ARCHITECTURE
Two basic concerns to use MetaQuerier
 First, to make the deep Web systematically accessible, it
will help users find online databases useful for their queries
 To make the deep Web uniformly usable, it will help users
query online databases.
METAQUERIER ARCHITECTURE
MetaQuerier Architecture has two basic stands
Dynamic Discovery
As Sources are Changing so they must
be dynamically discovered for integration.
There are no preselected sources
On the Fly Integration
As queries are ad-hoc, so MetaQuerier
must mediated them on the fly for relevant
sources. There is not pre configured sources
METAQUERIER ARCHITECTURE
METAQUERIER ARCHITECTURE
PARTS OF METAQUERIER SYSTEM
Results Compilation
Query Translation
Source Selection
Front end
Deep web
Repository
Back end
Database Crawler, Source Clustering, Schema Matching, Interface Extraction
METAQUERIER ARCHITECTURE
PROCESSES OF THE METAQUERIER SYSTEM ARCHITECTURE
DATA CRAWLER
INTERFACE EXTRACTION
SOURCE CLUSTERING
SCHEMA MATCHING
RESULT COMPILATION
QUERY TRANSLATION
SOURCE SELECTION
PROCESSES OF THE METAQUERIER ARCHITECTURE
Data Crawler
A process to gather certain information from the web database
and other resources.
Similar to Web Crawling – A process used by search engines to
search on Internet as queried.
Data Crawler search data by filtering and categorizing to make
the system efficient.
It has two different segments –
Site Crawler
Shallow Crawler
PROCESSES OF THE METAQUERIER ARCHITECTURE
Data Crawler
Workflow:
oSite Crawler needs efficient query interface.
oIt takes user querable keywords from the interface and filters the
query.
oSite Crawler goes through the root page.
oIt identifies IP addresses.
oShallow Crawler follows those found IP addresses.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Data Crawler
Advantages
Two major challenges can be accomplished through
data
crawling.
Dynamic Discovery
Deep web searching.
Dynamic Discovery is covered through Site Crawler.
Deep web searching is covered through Shallow Crawler.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Interface Extraction
Interface Extraction extracts data the required data from the query
interfaces.
Query interface share similar query patterns but sometimes different.
Different query patterns arise due to hidden information or attributes.
These attributes are not visual on interface.
Workflow:
Data Crawler hands over a huge amount of unsorted and hidden data.
IE generates a query which extracts the found data.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Interface Extraction
Key Features
It takes query interfaces in HTML format.
Then it functions as a visual language parser.
Interface Extraction tokenizes the page, parses the tokens and then
merges potentially multiple trees.
Finally it generates the query capabilities.
The basic idea of interface extraction is to extract query capabilities
from query interfaces.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Source Selection
Defined a common mediated schema for all data sources, we need
to match and map the data sources according to mediated
schema.
The target user may understand the concepts in their own domain
but may not know what on other domains.
The solution is to set the sources to include in data integration and
what mediated schema to use.
All ontologies are stored in a common repository.
The system identifies which ontology will be used based on the user
submitted query.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Result Compilation
Last process of the data integration.
It aggregates query results to the user.
It compiles data results from different sources into coherent pieces.
Will be used for extracting data from schema matching and
matching other attributes across different sources.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Source Clustering
• Collaborates with source selection which works in the
front-end.
• Clusters sources according to subject domain (e.g. edu,
org etc).
• Sorts data as mediated process which provides data
towards schema process.
• Main task is to construct a hierarchy of clusters with a
given set of query capabilities.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Source Clustering (Cont.)
CHARACTERISTICS OF DOMAIN ELEMENTS AND CONSTRAINT ELEMENTS:
• Textboxes cannot be used for constraint elements.
• Radio buttons or checkboxes or selection lists may appear as
constraint elements.
• An attribute consists of a single element cannot have constraint
elements.
• An attribute consisting of only radio buttons or checkboxes does
not have constraint elements.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Source Clustering (Cont.)
HOW TO DIFFERENTIATE BETWEEN DOMAIN & CONSTRAINT ELEMENTS:
A simple two-step method can be used:
1. First, identify the attributes that contain only one element or
whose elements are all radio buttons, or checkboxes or
textboxes.
2. Second, an Element Classifier is needed to process other
attributes that may contain both domain elements and
constraint elements. Each element is represented as a feature
of four: element name, element format type, element relative
position in the element list, and element values.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Source Clustering (Cont.)
DERIVING INFORMATION FROM ATTRIBUTES:
Four types of information for each attribute are defined (only for
domain elements):
1. Domain type: Indicates how many distinct values can be
used for an attribute for queries. There are four domain
types are defined in our model:
 range
 finite
 infinite
 Boolean
2. Value type: Each attribute on a search interface has its own
semantic value type.
 All input values are treated as text values
PROCESSES OF THE METAQUERIER ARCHITECTURE
Source Clustering (Cont.)
3. Default Value:
 Indicate some semantics of the attributes.
 May occur in a selection list, a group of radio buttons and a
group
of checkboxes.
 Always marked as “checked” or “selected” .
4. Unit:


Defines the meaning of an attribute value
e.g., kilogram is a unit for weight.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Schema Matching
• Schema defines the tables, the fields in each table, and the relationships
between fields and tables.
• It is the graphical representation of a database structure.
• Schema matching is the process of identifying two objects whether they
are semantically related or not while mapping refers to the
transformations between the objects.
• In data integration, schema matching finds out the semantic domain
values among the attributes, which have been found through
query interfaces.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Schema Matching
• Uses data from query capabilities and organize the data as per
requirement.
• It provides data to Source selection and Query Translation and finally
sends the data to users at the front-end.
• MetaQuerier redesigns the process in terms of complex matching
instead of one by one process.
PROCESSES OF THE METAQUERIER ARCHITECTURE
Schema Matching
PROCESSES OF THE METAQUERIER ARCHITECTURE
Query Translation
• A front end process.
•Translation is necessary to match and express query conditions in
terms of what an interface sends.
•It is critical to automatically interpret queries
Steps For complete query translation:
Step 1: extract constraint templates from a query interface.
Step 2: find matching templates from given source and target constraint
templates
PROCESSES OF THE METAQUERIER ARCHITECTURE
Query Translation
Constraint mapping:
• The objective is to find the target constraint with the closest semantic meaning
to the source constraint.
Query mediation:
• Mediating queries across multiple sources.
• Abstract the problem as a pattern of answering query using views.
• Focus is to decompose a user query into sub-queries across multiple sources.
Schema mapping:
• Translates a set of data values from one source to another one, according to
given matching.
• Only concerns about the equality relation between different schemas.
BIO MEDICAL DATA - - PROTEIN
WHAT IS PROTEIN:
Any of a large group of nitrogenous organic compounds that
are essential constituents of living cells; consist of polymers of
amino acids; essential in the diet of animals for growth and for
repair of tissues; can be obtained from meat and eggs and milk
and legumes
TYPE OF
MACRO MOLECULE
SUPER MOLUCULE
PART OF
AMINO ACID
AMINO ALCANICACID
PLOYPEPTIDE
BIO MEDICAL DATA - - PROTEIN
SOME EXAMPLE OF PROTEIN INFORMATION
BIO MEDICAL DATA - - PROTEIN
AVAILABLE WEB SERVICES ABOUT PROTEIN
Source Clustering (Example)
DERIVING INFORMATION FROM ATTRIBUTES:
1. Domain type: range, finite, infinite and Boolean
 Here, two textboxes are used to represent a range for the attribute
Production Year, thus the attribute should have range domain type.
Source Clustering (Example)
2. Value type: Distinct Values
 For example, the attribute Onlooker’s age or Reader age semantically has
integer values, and Production date has date values
Source Clustering (Cont.)
3. Default Value:
 In the previous figure, the attribute Onlooker’s age has a default value
“all ages”
4. Unit:

one search interface may use “Milligrams/grams” as the unit of its
Concentration attribute, while another may use “Litres” for its
Concentration attribute.
Query Translation (Example)
Two Bio-Medical Data query interfaces and their matching
• Name of Bio-Medical data – Proteins
• Constraint templates –
 S1: name
 S2: category
 S3: concentration; [between; $low, $high]
 S4: onlooker’s age;[ in; {[18:65],…}]
 Look at the interfaces
 T1: name
 T2: source
 T3: onlooker’s age
 T4: concentration
Query Translation (Example)
 S1: name
 S2: category
 S3: concentration; [between; $low, $high]
 S4: onlooker’s age;[ in; {[18:65],…}]
 T1: name
 T2: source
 T3: onlooker’s age
 T4: concentration
 Focus is to translate between “matching” constraint templates S2 in Q1
matches T2 in Q2.
 We need to extract constraint templates (T1,…,T4) .
 Given source and target constraint templates (Q1 and Q2 respectively), we need
to find matching templates.
Query Translation (Example)
Constraint mapping across Query Interfaces (Q1 and Q2)
• Constraint mapping is to instantiate T2 into t2 = [source; all words;
"Membrane Protein"]
 The best translation of the source constraint s2, i.e., s2  t2
Query Translation (Example)
Translation rules T12 between Q1 and Q2
To translate queries we need the following mapping techniques:
r1
[category; contain; $s]  emit: [source; all; $s]
r2
[name; contain; $t]  emit: [name; contain; $t]
r3
[concentration
range;
between;
$s,
$t]

$p
ChooseClosestNum($s), emit: [concentration; less than; $p]
r4
[onlooker’s age; between; $s]  $r = ChooseClosestRange($s),
emit: [age; between; $r]
Table: Translation Rules
=
Query Translation (Example)
 Text type constraints operators: any, all, exact, start and string values,
 Numeric type constraints: equal, greater than, less than, between and
numeric values.
Query Translation (Example)
The constraint mapping framework
Source constraint s and a target constraint template T
• Gives output to the closest target constraint topt, that T can generate
to s.
• The type recognizer identifies the type of the constraints, and then
dispatches them accordingly to the type handler.
Query Translation (Example)
• The type handler performs the search to find a good instantiation among
possible ones described by T, which is then returned as the mapping.
•The type recognizer takes the source constraint s and target constraint
template T as input, and infers the data type by analyzing the constraints
syntactically.
• The type handler takes the constraints dispatched by the type recognizer
as input and performs search among possible instantiations of the target
constraint template for the best one.
Query Translation (Example)
Mapping the constraints between category in Q1 and source in Q2:
• Source constraint s = [category; contain; "Membrane Protein"] is instantiated from
template S = [category; contain; $val] by populating $val=" Membrane Protein"
• Target constraint template T = [source; $op; $val] accepts operators $op from
{"any words", "all words"} and value $val from any string.
t1
[source; any; “Membrane Protein”]
t2
[source; all; “Membrane Protein”]
t3
[source; any; “Membrane”]
t4
[source; any; “Protein”]
• Among the candidate target constraints t1, t2, . . . , from I(T), the constraint
mapping thus searches for the element that is closest to the source.
EXAMPLE OF AN INTERFACE
EXAMPLE OF AN INTERFACE
“Invention date” implies the Attribute is semantically a date data type.
Two elements are used to specify a range query condition with
different roles in specifying the condition.
Such semantic information is hidden from computers.
Not defined on query interfaces.
This HIDDEN information about each attribute needs to be revealed
and defined to enrich the schema matching.
OVERVIEW OF THE SYSTEM
SOURCES ARE NOT PREDEFINED AND PRE CONFIGURED.
SO NEED TO FIND SOURCES DYNAMICALLY ACCORDING
TO THE USER AD HOC INFORMATION
AFTER DISCOVERY OF THE WEB DATABASES ITS IS NEEDED
TO EXTRACT THE QUERY CAPABILITIES AND ITS IS ALSO
AUTOMATIC AND ON THE FLY
THEN QUERYING THE SOURCES TRANSLATE THE QUERY
ON THE FLY SINCE SOURCE ARE UNSEEN
OVERVIEW OF THE SYSTEM
WORK FLOW OF THE SYSTEM
BACK END  SEMANTIC DISCOVERY
- DATA CRAWLER
- automatically collect sources from the deep web
- INTERFACE EXTRACTION
- Extract query capabilities from interface
- SOURCE CLUSTERING
- Clustering interface into sub domain
- SCHEMA MATCHING
- Discover semantic matching
FRONT END  EXECUTION OF QUERY
- PROVIDE USER A DOMAIN CATEGORY
- FOR EACH CATEGORY A UNIFIED INTERFACE IS GENERATED BY SM
- SELECT APPROPRIATE SOURCES TO RUN QUERY (SS)
- SELECTED SOURCES ARE TRANSLATED BY QUERY TRANSLATION
- FINALLY AGGREGATE THE RESULT BY RESULT COMPILATION
CONCLUSION
Our target is to deploy MetaQuerier as an
efficient data integration architecture.
The implementation can be done
successfully based on Bio Medical Data
Inside the subsystem of MetaQuerier there
are some conceptual changes can be
done to improve the efficiency of handling
huge unsorted data
THANK YOU
ANY QUESTION
?