Download Presentation

BIOMEDICAL DATA INTEGRATION BASED ON METAQUERIER ARCHITECTURE ADVISOR : KHONDKER SHAJADUL HASAN CO – ADVISOR : JAVED SIDDIQUE GROUP MEMBERS -NAIEEM KHAN -EUSUF ABDULLAH MIM -M SAMIULLAH CHOWDHURY Three basic parts of the project  DATA INTEGRATION  METAQUERIER ARCHITECTURE  BIOMEDICAL DATA DATA INTEGRATION What does it mean? Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data. This process emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories) . Data integration appears with increasing frequency as the volume and the need to share existing data explodes. DATA INTEGRATION Simple schematic for a data warehouse. The information from the source databases is extracted, transformed then loaded into the data warehouse DATA INTEGRATION Difficulties of Data Integration Huge web database Database content are now dynamic Necessity of efficient data crawler Accurate and perfect Query Interfaces Time efficiency Depth Volume of data handling DATA INTEGRATION Importance Integration from web databases. In order to get necessary information from different sources data integration is very important. In order to get a Large scale Integration Efficient and accurate query answers. Consider a user, who is moving to a new town. To start with, different queries need different sources to answer: Where can she look for real estate listings? (e.g., realtor.com.) Studying for a new car? (cars.com.) Looking for a job? (monster.com.) Further, different sources support different query capabilities: After source hunting, the user must then learn the gruelling details of querying each source. METAQUERIER ARCHITECTURE METAQUERIER ARCHITECTURE There are different approaches and paradigms for data integration, some of them are• Materialized: physical, integrated repository is created here. • Data Warehouses: physical repositories of selected data extracted from a collection of DBs and other information sources. • Mediated: data stay at the sources, a virtual integration system is created. • Federated and cooperative: DBMSs are coordinated to collaborate. • Exchange: Data is exported from one system to another. • Peer-to-Peer data exchange: Many peers exchange data without a central control mechanism. Data is passed from peer to peer upon request, as query answers. METAQUERIER ARCHITECTURE Two basic concerns to use MetaQuerier  First, to make the deep Web systematically accessible, it will help users find online databases useful for their queries  To make the deep Web uniformly usable, it will help users query online databases. METAQUERIER ARCHITECTURE MetaQuerier Architecture has two basic stands Dynamic Discovery As Sources are Changing so they must be dynamically discovered for integration. There are no preselected sources On the Fly Integration As queries are ad-hoc, so MetaQuerier must mediated them on the fly for relevant sources. There is not pre configured sources METAQUERIER ARCHITECTURE METAQUERIER ARCHITECTURE PARTS OF METAQUERIER SYSTEM Results Compilation Query Translation Source Selection Front end Deep web Repository Back end Database Crawler, Source Clustering, Schema Matching, Interface Extraction METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER SYSTEM ARCHITECTURE DATA CRAWLER INTERFACE EXTRACTION SOURCE CLUSTERING SCHEMA MATCHING RESULT COMPILATION QUERY TRANSLATION SOURCE SELECTION PROCESSES OF THE METAQUERIER ARCHITECTURE Data Crawler A process to gather certain information from the web database and other resources. Similar to Web Crawling – A process used by search engines to search on Internet as queried. Data Crawler search data by filtering and categorizing to make the system efficient. It has two different segments – Site Crawler Shallow Crawler PROCESSES OF THE METAQUERIER ARCHITECTURE Data Crawler Workflow: oSite Crawler needs efficient query interface. oIt takes user querable keywords from the interface and filters the query. oSite Crawler goes through the root page. oIt identifies IP addresses. oShallow Crawler follows those found IP addresses. PROCESSES OF THE METAQUERIER ARCHITECTURE Data Crawler Advantages Two major challenges can be accomplished through data crawling. Dynamic Discovery Deep web searching. Dynamic Discovery is covered through Site Crawler. Deep web searching is covered through Shallow Crawler. PROCESSES OF THE METAQUERIER ARCHITECTURE Interface Extraction Interface Extraction extracts data the required data from the query interfaces. Query interface share similar query patterns but sometimes different. Different query patterns arise due to hidden information or attributes. These attributes are not visual on interface. Workflow: Data Crawler hands over a huge amount of unsorted and hidden data. IE generates a query which extracts the found data. PROCESSES OF THE METAQUERIER ARCHITECTURE Interface Extraction Key Features It takes query interfaces in HTML format. Then it functions as a visual language parser. Interface Extraction tokenizes the page, parses the tokens and then merges potentially multiple trees. Finally it generates the query capabilities. The basic idea of interface extraction is to extract query capabilities from query interfaces. PROCESSES OF THE METAQUERIER ARCHITECTURE Source Selection Defined a common mediated schema for all data sources, we need to match and map the data sources according to mediated schema. The target user may understand the concepts in their own domain but may not know what on other domains. The solution is to set the sources to include in data integration and what mediated schema to use. All ontologies are stored in a common repository. The system identifies which ontology will be used based on the user submitted query. PROCESSES OF THE METAQUERIER ARCHITECTURE Result Compilation Last process of the data integration. It aggregates query results to the user. It compiles data results from different sources into coherent pieces. Will be used for extracting data from schema matching and matching other attributes across different sources. PROCESSES OF THE METAQUERIER ARCHITECTURE Source Clustering • Collaborates with source selection which works in the front-end. • Clusters sources according to subject domain (e.g. edu, org etc). • Sorts data as mediated process which provides data towards schema process. • Main task is to construct a hierarchy of clusters with a given set of query capabilities. PROCESSES OF THE METAQUERIER ARCHITECTURE Source Clustering (Cont.) CHARACTERISTICS OF DOMAIN ELEMENTS AND CONSTRAINT ELEMENTS: • Textboxes cannot be used for constraint elements. • Radio buttons or checkboxes or selection lists may appear as constraint elements. • An attribute consists of a single element cannot have constraint elements. • An attribute consisting of only radio buttons or checkboxes does not have constraint elements. PROCESSES OF THE METAQUERIER ARCHITECTURE Source Clustering (Cont.) HOW TO DIFFERENTIATE BETWEEN DOMAIN & CONSTRAINT ELEMENTS: A simple two-step method can be used: 1. First, identify the attributes that contain only one element or whose elements are all radio buttons, or checkboxes or textboxes. 2. Second, an Element Classifier is needed to process other attributes that may contain both domain elements and constraint elements. Each element is represented as a feature of four: element name, element format type, element relative position in the element list, and element values. PROCESSES OF THE METAQUERIER ARCHITECTURE Source Clustering (Cont.) DERIVING INFORMATION FROM ATTRIBUTES: Four types of information for each attribute are defined (only for domain elements): 1. Domain type: Indicates how many distinct values can be used for an attribute for queries. There are four domain types are defined in our model:  range  finite  infinite  Boolean 2. Value type: Each attribute on a search interface has its own semantic value type.  All input values are treated as text values PROCESSES OF THE METAQUERIER ARCHITECTURE Source Clustering (Cont.) 3. Default Value:  Indicate some semantics of the attributes.  May occur in a selection list, a group of radio buttons and a group of checkboxes.  Always marked as “checked” or “selected” . 4. Unit:   Defines the meaning of an attribute value e.g., kilogram is a unit for weight. PROCESSES OF THE METAQUERIER ARCHITECTURE Schema Matching • Schema defines the tables, the fields in each table, and the relationships between fields and tables. • It is the graphical representation of a database structure. • Schema matching is the process of identifying two objects whether they are semantically related or not while mapping refers to the transformations between the objects. • In data integration, schema matching finds out the semantic domain values among the attributes, which have been found through query interfaces. PROCESSES OF THE METAQUERIER ARCHITECTURE Schema Matching • Uses data from query capabilities and organize the data as per requirement. • It provides data to Source selection and Query Translation and finally sends the data to users at the front-end. • MetaQuerier redesigns the process in terms of complex matching instead of one by one process. PROCESSES OF THE METAQUERIER ARCHITECTURE Schema Matching PROCESSES OF THE METAQUERIER ARCHITECTURE Query Translation • A front end process. •Translation is necessary to match and express query conditions in terms of what an interface sends. •It is critical to automatically interpret queries Steps For complete query translation: Step 1: extract constraint templates from a query interface. Step 2: find matching templates from given source and target constraint templates PROCESSES OF THE METAQUERIER ARCHITECTURE Query Translation Constraint mapping: • The objective is to find the target constraint with the closest semantic meaning to the source constraint. Query mediation: • Mediating queries across multiple sources. • Abstract the problem as a pattern of answering query using views. • Focus is to decompose a user query into sub-queries across multiple sources. Schema mapping: • Translates a set of data values from one source to another one, according to given matching. • Only concerns about the equality relation between different schemas. BIO MEDICAL DATA - - PROTEIN WHAT IS PROTEIN: Any of a large group of nitrogenous organic compounds that are essential constituents of living cells; consist of polymers of amino acids; essential in the diet of animals for growth and for repair of tissues; can be obtained from meat and eggs and milk and legumes TYPE OF MACRO MOLECULE SUPER MOLUCULE PART OF AMINO ACID AMINO ALCANICACID PLOYPEPTIDE BIO MEDICAL DATA - - PROTEIN SOME EXAMPLE OF PROTEIN INFORMATION BIO MEDICAL DATA - - PROTEIN AVAILABLE WEB SERVICES ABOUT PROTEIN Source Clustering (Example) DERIVING INFORMATION FROM ATTRIBUTES: 1. Domain type: range, finite, infinite and Boolean  Here, two textboxes are used to represent a range for the attribute Production Year, thus the attribute should have range domain type. Source Clustering (Example) 2. Value type: Distinct Values  For example, the attribute Onlooker’s age or Reader age semantically has integer values, and Production date has date values Source Clustering (Cont.) 3. Default Value:  In the previous figure, the attribute Onlooker’s age has a default value “all ages” 4. Unit:  one search interface may use “Milligrams/grams” as the unit of its Concentration attribute, while another may use “Litres” for its Concentration attribute. Query Translation (Example) Two Bio-Medical Data query interfaces and their matching • Name of Bio-Medical data – Proteins • Constraint templates –  S1: name  S2: category  S3: concentration; [between; $low, $high]  S4: onlooker’s age;[ in; {[18:65],…}]  Look at the interfaces  T1: name  T2: source  T3: onlooker’s age  T4: concentration Query Translation (Example)  S1: name  S2: category  S3: concentration; [between; $low, $high]  S4: onlooker’s age;[ in; {[18:65],…}]  T1: name  T2: source  T3: onlooker’s age  T4: concentration  Focus is to translate between “matching” constraint templates S2 in Q1 matches T2 in Q2.  We need to extract constraint templates (T1,…,T4) .  Given source and target constraint templates (Q1 and Q2 respectively), we need to find matching templates. Query Translation (Example) Constraint mapping across Query Interfaces (Q1 and Q2) • Constraint mapping is to instantiate T2 into t2 = [source; all words; "Membrane Protein"]  The best translation of the source constraint s2, i.e., s2  t2 Query Translation (Example) Translation rules T12 between Q1 and Q2 To translate queries we need the following mapping techniques: r1 [category; contain; $s]  emit: [source; all; $s] r2 [name; contain; $t]  emit: [name; contain; $t] r3 [concentration range; between; $s, $t]  $p ChooseClosestNum($s), emit: [concentration; less than; $p] r4 [onlooker’s age; between; $s]  $r = ChooseClosestRange($s), emit: [age; between; $r] Table: Translation Rules = Query Translation (Example)  Text type constraints operators: any, all, exact, start and string values,  Numeric type constraints: equal, greater than, less than, between and numeric values. Query Translation (Example) The constraint mapping framework Source constraint s and a target constraint template T • Gives output to the closest target constraint topt, that T can generate to s. • The type recognizer identifies the type of the constraints, and then dispatches them accordingly to the type handler. Query Translation (Example) • The type handler performs the search to find a good instantiation among possible ones described by T, which is then returned as the mapping. •The type recognizer takes the source constraint s and target constraint template T as input, and infers the data type by analyzing the constraints syntactically. • The type handler takes the constraints dispatched by the type recognizer as input and performs search among possible instantiations of the target constraint template for the best one. Query Translation (Example) Mapping the constraints between category in Q1 and source in Q2: • Source constraint s = [category; contain; "Membrane Protein"] is instantiated from template S = [category; contain; $val] by populating $val=" Membrane Protein" • Target constraint template T = [source; $op; $val] accepts operators $op from {"any words", "all words"} and value $val from any string. t1 [source; any; “Membrane Protein”] t2 [source; all; “Membrane Protein”] t3 [source; any; “Membrane”] t4 [source; any; “Protein”] • Among the candidate target constraints t1, t2, . . . , from I(T), the constraint mapping thus searches for the element that is closest to the source. EXAMPLE OF AN INTERFACE EXAMPLE OF AN INTERFACE “Invention date” implies the Attribute is semantically a date data type. Two elements are used to specify a range query condition with different roles in specifying the condition. Such semantic information is hidden from computers. Not defined on query interfaces. This HIDDEN information about each attribute needs to be revealed and defined to enrich the schema matching. OVERVIEW OF THE SYSTEM SOURCES ARE NOT PREDEFINED AND PRE CONFIGURED. SO NEED TO FIND SOURCES DYNAMICALLY ACCORDING TO THE USER AD HOC INFORMATION AFTER DISCOVERY OF THE WEB DATABASES ITS IS NEEDED TO EXTRACT THE QUERY CAPABILITIES AND ITS IS ALSO AUTOMATIC AND ON THE FLY THEN QUERYING THE SOURCES TRANSLATE THE QUERY ON THE FLY SINCE SOURCE ARE UNSEEN OVERVIEW OF THE SYSTEM WORK FLOW OF THE SYSTEM BACK END  SEMANTIC DISCOVERY - DATA CRAWLER - automatically collect sources from the deep web - INTERFACE EXTRACTION - Extract query capabilities from interface - SOURCE CLUSTERING - Clustering interface into sub domain - SCHEMA MATCHING - Discover semantic matching FRONT END  EXECUTION OF QUERY - PROVIDE USER A DOMAIN CATEGORY - FOR EACH CATEGORY A UNIFIED INTERFACE IS GENERATED BY SM - SELECT APPROPRIATE SOURCES TO RUN QUERY (SS) - SELECTED SOURCES ARE TRANSLATED BY QUERY TRANSLATION - FINALLY AGGREGATE THE RESULT BY RESULT COMPILATION CONCLUSION Our target is to deploy MetaQuerier as an efficient data integration architecture. The implementation can be done successfully based on Bio Medical Data Inside the subsystem of MetaQuerier there are some conceptual changes can be done to improve the efficiency of handling huge unsorted data THANK YOU ANY QUESTION ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation