Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dvoy Database Ideas • Heterogeneous to homogeneous • Homogenization by applying uniform schema: Multidimensional data model • User queries are directed toward the nDim data cubes, generally for extraction of data slices along specific dimensions. • The slices represent data views, e.g. map view, time view – Fore relational data, homogenization is by view-specific adding queries – For unstructured data, metadata is added to describe the structure, so the agent can retrieve the data for views Multi-dimensional Data Access k Data Granule Selection j i i j i 1D Dataset 2D Dataset 3D Dataset e.g. Time selector e.g. Param & Time selector e.g. Param, Time & Location • In array notation, the granule ‘value’ is accessed as – MyGranule = My1DArray(i) – MyGranule = My2DArray(i,j) – MyGranule = MynDArray(i,j,…..n) • In order to select a data granule, a controller is assigned to each data dimension Multi-Dimensional Data Model 1 Dimensional i e.g. Time dimension Data Space View 2 j 1 Dimensional e.g. Location & Time i View 1 Data Granule k 1 Dimensional e.g. Location, Time & Parameter i j Data can be distributed over 1,2, …n dimensions Views are orthogonal slices through multidimensional data cubes Spatial and temporal slices through the data are most common Database Schema Design • Fact Table: A fact table (yellow) contains the main data of interest, i.e. the pollutant concentration by location, day, pollutant and measurement method. • Star Schema consists of a central fact table surrounded by de-normalized dimensional tables (blue) describing the sites, parameters, methods.. • Snowflake Schema is an extension of the star schema where each point of the star ‘explodes’ into further fully normalized tables, expanding the description of each dimension. • Snowflake schema can capture all the key data content and relationships if full detail. It is well suited for capturing and encoding complex monitoring data into a robust relational database. IMPROVE Relational Database Schema (Tentative) For the IMPROVE data, an extended star schema captures the needed information The dimensional tables (Site, Parameter, Method) are partly normalized (further broken into relational tables) Snowflake Example: Central Calif. AQ Study, CCAQS CCAQS schema incorporates a rich set of parameters needed for QA/QC (e.g. sample tracking) as well as for data analysis. The fully relational CCAQS schema permits the enforcing of integrity constraints and it has been demonstrated to be useful for data entry/verification. However, no two snowflakes are identical. Similarly, the rich snowflake schemata for one sampling/analysis environment cannot be easily transplanted elsewhere. More importantly, many of the recorded parameters ‘on the fringes’ are not particularly useful for integrative, cross-supersite, regional analyses. Catalog of Multidimensional Datasets • • • • • Designed to Publish, Find(Select), Bind(Access) and Render distributed datasets Publishing is though an open web interface for user/broker registration of datasets Finding a dataset is aided by a metadata on Provider and Dataset Binding (data access) information is contained in the Dimensional Tables Rendering parameters are also contained in the Dimensional Tables Minimal Star Schema for Spatio-Temporal Data In order to respond to data queries by time, location and parameter, the database has to have time, location and parameter as dimensions • The minimal Site table includes SiteID, Name and Lat/Lon. • The minimal Parameter table consists of ParamterID, Description and Unit • The time dimensional table is usually skipped since time is self-describing • The minimal Fact (Data) table consists of the Observation value and the three dimensional codes for DateTime, Site and Parameter • The above minimal multidimensional schema was used in the CAPITA data exploration software, Voyager. From Heterogeneous to Homogeneous Schema • Individual Supersite SQL databases can be queried along spatial, temporal and parameter dimensions. However, the query to retrieve the same information depends on the of the particular database. • A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red) • The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will be arrived at by consensus. Subset used Uniform Schema Fact Table Data Wrapper Extraction of homogeneous data from heterogeneous sources Remote data access configuration for DATAFED CGI or WS call [[param_list]] Wrapper Software [[filter]] Fill template Call SQL Dvoy Return Data Dataset XML or CSV Firewall • SQL query template Connection between DATAFED and remote server can be either through Web Services or CGI call. SQL Server Spatial Query Template for DVoy SELECT SiteTable.Longitude * (180.0/PI()) AS Lon, SiteTable.Latatitude * (180.0/PI()) AS Lat, SiteTable. Site_ID, [[param_list]] FROM DataFactTable RIGHT OUTER(??) JOIN SiteTable ON DataFactTable.Site_ID = SiteTable.Site_ID WHERE [[filter]] ORDER BY SiteTable.Lon ASC Default [[param_list]] – list of fields to be extracted in the table DataFactTable. [[param_abbr]] AS VALUE Default [[filter]] (for spatial query, datetime_min = datatime_max): (DataFactTable.Date_Time BETWEEN '[[datetime_min]]' AND '[[datetime_max]]') • • • Items in red brackets [[item]] are replaced by the code at run time. SiteTable and DataFactTable names to be replaced by the actual table names The JOIN can be an inner join for some tables Purpose of the Supersite Relational Database System Design, populate and maintain a database which: – Includes monitoring data from Supersites and auxiliary projects – Facilitates cross-Supersite [regional or comparative] data analyses – Supports the analyses by a variety of research groups Stated Features of Relational Data System • Data Input: – Data input electronically through FTP, Web browser, (CD, if necessary) – Modest amount of metadata on sites, instruments, data sources/version, contacts etc. – Data structures, formats and submission procedures simple for the submitters • Data Storage and Maintenance: – Data stored in relational database(s), possibly distributed over multiple servers – Maintenance of data holdings catalog and and request logs – Data updates quarterly • Data Access: – – – – Access method: User-friendly web-access by multiple authorized users Data finding: Metadata catalog of datasets Data query: by parameter, method, location, date/time, or other metadata Data output format: ASCII, spreadsheet, other (dbf, XML) Snowflake Example: Central Calif. AQ Study, CCAQS The CCAQS provides a good example of a snowflake relational schema. It incorporates a rich set of parameters needed for QA/QC as well as for data analysis The ‘branches’ and ‘leafs’ contain tables for QA/QC and for detailed dimensional metadata The fully relational snowflake schema permits the enforcing of integrity constraints However, its hard to apply an elaborate snowflake schema for other sampling/analysis conditions Data Preparation Procedures: • Data gathering, QA/QC and standard formatting is to be done by individual projects • The data exchange standards, data ingest and archives are by ORNL and NASA • Data ingest is to automated, aided by tools and procedures supplied by this project – NARSTO DES-SQL translator – Web submission tools and procedures – Metadata Catalog and I/O facilities • Data submissions and access will be password protected as set by the community. • Submitted data will be retained in a temporary buffer space and following verification transferred to the shared SQL database. • The data access, submissions etc. will be automatically recorded an summarized Database Schema Design for the Federated Data Warehouse Nov 27, 2001, RBH Data Warehouse Features • As much as possible data should reside in their respective home environment. ‘Uprooted’ data in decoupled databases tend to decay ie can not be easily updated, maintained, enriched. • Abstract (universal) query/retrieval facilitates integration and comparison along the key dimensions (space, time, parameter, method) • The open architecture data warehouse (based on Web Services) promotes the building of further value chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc.. Proposal Presentation to the Supersite Program Nov 30, 2001 Supersite Relational Database System (SRDS) a sub- project of St. Louis Midwest Supersite Project, Jay Turner, PI SQL Data Servers Subset used Uniform Schema User Fact Data Adapter Table Rudolf Husar, PI Center for Air Pollution Impact and Trend Analysis (CAPITA) Washington University, St. Louis, MO Purpose of the Project: Design, Populate and Maintain a Supersite Relational Database System • Facilitate cross-Supersite, regional, comparative data analyses • Support analyses by a variety of research groups • Include monitoring data from Supersites and auxiliary projects EPA Specs of the Supersite Relational Data System (from RFP) • Data Input: – Data input electronically – Modest amount of metadata on sites, instruments, data sources/version, contacts etc. – Simple data structures, formats and convenient submission procedures • Data Storage and Maintenance: – Data storage in relational database(s), possibly distributed over multiple servers – A catalog of data holdings and request logs – Supersite data updates quarterly • Data Access: – User-friendly web-access by multiple authorized users – Data query by parameter, method, location, date/time, or other metadata – Multiple data output formats (ASCII, spreadsheet, other (dbf, XML) General Approach to SRDS Design • Based on consensus, adopt a uniform relational data structure, suitable for regional and cross-Supersite data integration and analysis. • We propose a star schema with spatial, temporal, parameter and method dimensions. • The ‘original’ data are to be maintained at the respective providers or custodians (Supersites, CIRA, CAPITA...). • We propose the creation of flexible ‘adapters’ and web-submission forms for the transfer of data subsets into the uniformly formatted ‘Federated Data Warehouse’. • Data users would access the data warehouse manually or through software. • We propose data access using modern ‘web services’ protocol, suitable for adding data viewers, processors (filtering, aggregation and fusion) and other value-adding processes. From Heterogeneous to Homogeneous Schema • Individual SQL databases have varied designs, usually following a more elaborate ‘snowflake’ pattern (see Database Schema Design for the Federated Data Warehouse). • Though they have more complicated schemata, these Supersite SQL servers can be queried along spatial, temporal, parameter, method dimensions. However, the query to retrieve the same information depends on the particular database schema. • A way to homogenize the distributed data is by accessing all the data through a Data Wrapper which accesses only a subset of the tables/fields from any particular database (shown red in schemata below). • The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). • The final form of the uniformly extracted data schema will be arrived at by Supersite consensus. Subset used Uniform Schema Fact Table Data Adapter Extraction of homogeneous data from heterogeneous sources On-line Analytical Processing: OLAP • A multidimensional data model making it easy to select, navigate, integrate and explore the data. • An analytical query language providing power to filter, aggregate and merge data as well as explore complex data relationships. • Ability to create calculated variables from expressions based on other variables in the database. • Pre-calculation of frequently queried aggregated values, i.e. monthly averages, enables fast response time to ad hoc queries. Fast Analysis of Shared Multidimensional Information (FASMI) (Nigel, P. “The OLAP Report”) An OLAP system is characterized as: being Fast – The system is designed to deliver relevant data to users quickly and efficiently; suitable for ‘real-time’ analysis facilitating Analysis – The capability to have users extract not only “raw” data but data that they “calculate” on the fly. being Shared – The data and its access are distributed. being Multidimensional – The key feature. The system provides a multidimensional view of the data. exchanging Information – The ability to disseminate large quantities of various forms of data and information. Multi-Dimensional Data Cubes •Multi-dimensional data models use inherent relationships in data to populate multidimensional matrices called data cubes. •A cube's data can be queried using any combination of dimensions •Hierarchical data structures are created by aggregating the data along successively larger ranges of a given dimension, e.g time dimension can contain the aggregates year, season, month and day. Conceptual Data Cube Data Measure, DataPoints and Data Cube Measure A measure (in OLAP terminology) represent numerical values for a specific entity to be analyzed (e.g. temperature, wind speed, pollutant). Measure Dimension Z Data Granules Dimension Y A collection of measures form a special dimension ‘ Measures’ (??Can Measures be Dimensions??) Data Granules Dimension X A measure has set of discrete data granules –atomic data entities that cannot be further broken down. All data points in a measure represent the same measured parameter e.g. temperature. Hence, they share the same units and dimensionality. The data points of a measure are enclosed in a conceptual multidimensional data cube; each data point occupies a volume (slice or point) in the data cube. Data points in a measure share the same dimensions; Conversely, each data point has to have the dimensional coordinates in the data cube of the measure that it belongs to. DVoy Data Space Data Space for a Measure Measure Data Space Data Views View Data Space Data Zmax Xmin Z Ymax Ymax Ymin Xmax Zmin Zmin Ymin Xmin Xmax Y X • aergqegqet Data Granule Zmin Measure Data Space View Data Space Data Granule DVoy: Components and Image Data Flow and Distributed Image Data Data Query for a Measure: Image Default Cube DX, DY, DZ, DT Web Service Measure: Registration Elevation Web Service Measure: TOMS Image Data Browser HTTP, FTP, Data XY MAP: Z,T fixed Web Service Catalog Image Delivery Measure: SeaWiFS Web Service Dvoy Components: • • • • • • Distributed Data Image Registration Data Catalog Data Query Image Delivery Image Viewer Distributed Voyager: Safe Connection for Remote Data (SQL) Servers R. Husar and K. Hoijarvi CAPITA October 24, 2003 DVoy Data Space Data Set Data Instance Dvoy: Components and Image Data Flow and Distributed Image Data Data Selection: Measure: Image Measure, X, Y, Z, T Elevation Registration Image Data Browser Measure: Web Service TOMS XY MAP: Z,T fixed Data Query Web Service Image Delivery Data Measure: Catalog SeaWiFS Web Service Dvoy Components: • • • • • • Distributed Data Image Registration Data Catalog Data Query Image Delivery Image Viewer DVoy Data Space, DataSet, Data Measures: Data Set: X, Y, Z, T Subset of the Data Space e.g. TOMS Images, 96-01 Data Instance: X, Y … Subset of Data Set single TOMS image Data Space Data Instance: Atomic data entity that cannot be broken down to smaller entities. It is a record in the database, e.g: Data Set - data point..Temperature (xi, yi, zi, ti) Data Zmin Ymax - image Temperature (xrange, yrange, ti ) Ymin Xmin Xmax Z Zmin Ymax Data Set: Collection of Data Instances Ymin Xmin Xmax Y X The parameter (or ‘variable’, or the OLAP ‘measure’) is NOT a dimension Rather, each parameter covers a multidimensional space. Service Chaining in Spatio-Temporal Data Browser Data Sources Homogenizer Catalog XDim Data Wrapper SQL Table Mediator OGC-Compliant GIS Services Spatial Portrayal Spatial Overlay Client Browser OLAP GIS Data Vector Data Cube Spatial Slice Time-Series Services Time Portrayal Satellite Time Overlay Time Slice Images Cursor/Controller Maintain Data Find/Bind Data Portray Overlay Render Data Entry to the Supersite Relational Data System: 1. Batch transfer of large Supersite and other datasets to the SRDS SQL server 2. Web-submission of relational tables by the data producers/custodians 3. Automatic translation and transfer of NARSTO-archived DES data to SQL NARSTO ORNL DES, Data Ingest 3. DES-SQL Transformer EPA EOSDIS Supersite Data Data Supersite Archive SQL Coordinated Supersite Relational Server 2. Direct Web Data Input Tables Supersite & other SQL Data Data Query 1. DataAdapter Table Output Summary of Proposed Database Schema Design • The starting point for the design of Supersite Relational Database schema will be the Minimal Star Schema for fixed-location monitoring data. • Extensions will be made if it clearly benefits regional analysis and cross-Supersite comparisons • The possible extensions, based on user needs, may include the addition of: • ‘Methods’ dimension table to identify the sampling/analysis method of each observation • Additional attributes (columns) Site and Parameter tables • The Supersite data are not yet ready for submission to the NARSTO archive. Thus, there is still time to develop an agreed-upon schema for the Supersite data in SRDS. • The schema modifications and and the consensus-building will be conducted through the SRDS website The RDMS Schema: ‘Minimal’ Star Schema For integrative, cross-Supersite analysis, the database has to have, at the minimum, a ‘fact table’ and associated time, location, parameter and method tables as dimensions • The minimal Sites table includes SiteID, Name and Lat/Lon. • The minimal Parameter table consists of ParamterID, Description and Unit • The time dimensional table is usually skipped since the time code is self-describing • The minimal Fact (Data) table consists of the Obs_Value and the three dimensional codes for Obs_DateTime, Site_ID and Parameter_ID • Additional dimension tables may include Method and Data Quality. The CAPITA data exploration software, Voyager uses this minimal schema. Voyager has been in use for the past 12 years successfully encoding and browsing 1000+ datasets worldwide. The state of California still formats and distributes their AQ data on CDs using Voyager.