Download Data Mining Engineering

Parallele und Verteilte Datenbanksysteme Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität Wien Tel. 4277 39425 Sprechstunde: Di, 13.00-14.00 LV-Portal: www.par.univie.ac.at/~brezany/teach/gckfk/300658.html P.Brezany Institut für Scientific Computing – Universität Wien Motivation Business Medicine Scientific experiments Data and data exploration cloud Simulations P.Brezany Earth observations Institut für Scientific Computing – Universität Wien 2 The Knowledge Discovery Process OLAP Queries OLAP Online Analytical Mining Knowledge Evaluation and Presentation Data Mining Data Warehouse Selection and Transformation Cleaning and Integration P.Brezany Institut für Scientific Computing – Universität Wien 3 Fig. 3.1 P.Brezany Institut für Scientific Computing – Universität Wien Data Preprocessing 4 EcoGRID Scetch Distributed Data Distributed Applications Biodiversity Reporting Waste Statistic Air Distributed Datamining Popular Presentation Soil Emmisions Flow Analysis Prediction Models Water Forests GeoStatistic … Common Ontology P.Brezany Institut für Scientific Computing – Universität Wien 5 Management of TBI patients • Traumatic brain injuries (TBIs) typically result from accidents in which head strikes an object. • The treatment of TBI patients is very resource intensive. • The trajectory of the TBI patients management: – – – – – Trauma event First aid Transportation to hospital Acute hospital care Home care Usage of mobile communication devices • All the above phases are associated with data collection into databases – now managed by individual hospitals. P.Brezany Institut für Scientific Computing – Universität Wien 6 Data Mining Accuracy vs. Data Size accuracy 100% sampled data size P.Brezany Institut für Scientific Computing – Universität Wien available data size 7 The GridMiner Project in Vienna • GridMiner : A knowledge discovery Grid infrastructure (http://www.gridminer.org/)  OGSA-based architecture  Workflow management  Grid-aware data preprocessing and data mining services  Data mediation service  OLAP service  GUI  Current Implementation on top of Globus Toolkit 3.2 • Applications : Exploration of ecological data, management of patients with traumatic brain injuries • Research exhibition available P.Brezany Institut für Scientific Computing – Universität Wien 8 Literatur Auf der WWW-Seite der LV P.Brezany Institut für Scientific Computing – Universität Wien 9 Distributed Memory Architecture (Shared Nothing) Interconnection Network P.Brezany CPU CPU CPU CPU Local Memory Local Memory Local Memory Local Memory Institut für Scientific Computing – Universität Wien 10 DMM: Shared Disk Architecture Interconnection Network CPU CPU CPU CPU Local Memory Local Memory Local Memory Local Memory Global Shared Disk Subsystem P.Brezany Institut für Scientific Computing – Universität Wien 11 Shared Memory Architecture (Shared Everything, SMP) Interconnection Network CPU CPU CPU CPU Global Shared Memory P.Brezany Institut für Scientific Computing – Universität Wien 12 Cluster of SMPs Interconnection Network P.Brezany CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU 4-CPU SMP 4-CPU SMP 4-CPU SMP 4-CPU SMP Institut für Scientific Computing – Universität Wien 13 High-Performance I/O Systems P.Brezany Institut für Scientific Computing – Universität Wien 14 P.Brezany Institut für Scientific Computing – Universität Wien 15 Note: RAID technology is introduced in a separate scriptum. P.Brezany Institut für Scientific Computing – Universität Wien 16 Principles of Distributed Database Systems The main literature P.Brezany Institut für Scientific Computing – Universität Wien Distributed Database System (DDBS) Technology – Introduction • DDBS is the union of what appears to be two diametrically opposed approaches to data processing: database systems and computer network technologies. • Database systems have taken us from a paradigm of data processing in which each application defined and maintained its own data (figure follows) to one in which the the data is defined and adminstered centrally (figure follows) -> data independence (The application programs are immune to changes in the logical and or physical organization of the data and vice versa.) One of the major motivations is the desire to integrate the operational data of an enterprise and to provide centralized, thus controlled access to that data. P.Brezany Institut für Scientific Computing – Universität Wien 18 DDBS – Introduction (cont.) • The technology of computer networks promotes a mode of work that goes against all centralization efforts. • How these two contrasting approaches can be synthesized to produce a technology that is more powerful and more promising than either one alone? – The key understanding is the realization that the most important objective of the database technolgy is integration, not centralization. It is important to realize that either one of these terms does not necessarily imply the other. – It is possible to achieve integration without centralization, and that is exactly what the distributed database technology attempts to achieve. P.Brezany Institut für Scientific Computing – Universität Wien 19 Distributed Database System Technology - Introduction P.Brezany Institut für Scientific Computing – Universität Wien 20 P.Brezany Institut für Scientific Computing – Universität Wien 21 Central Database on a Network Example Boston Edmonton Communication Network Paris San Francisco P.Brezany Institut für Scientific Computing – Universität Wien 22 Distributed Database System (DDBS) - Definitions • Definition 1: Distributed database. A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network. • Definition 2: Distributed database management system (DBMS). It is defined as the software system that permits the management of the DDBS and makes the distribution transparent to the users. • A DDBS is not a „collection of files“ that can be individually stored at each node of a computer network. To form a DDBS, files should not only be logically related, but there should be structure among the files, and access should be via a common interface. • The physical distribution of data is very important. It creates problems, that are not encountered when the databases reside in the same computer system. P.Brezany Institut für Scientific Computing – Universität Wien 23 Promises of DDBSs 1.Transparent Management of Distributed and Replicated Data • Transparency refers to separation of the higher-level semantics of a system from lower-level implementation issues; a transparent system „hides“ the implementation details from the user. • Example (next slide): Consider an engineering firm that has offices in several cities. – It is preferable, to localize each data such that data about the employees in Edmonton office are stored in Edmonton, ..., and so forth. The same applies to the project information. In this process we partition each of the relations and store each partition at a differetn site – it is known as fragmentation. – It may be preferable to duplicate some of this data at other sites for performance and reliability reasons. The result is a distributed database which is fragmented and replicated. Fully transparent access means that the users can still pose queries in the same form as to a centralized system, without paying any attention to the fragentation, location, or replication of data, and let the system worry about resolving these issues. P.Brezany Institut für Scientific Computing – Universität Wien 24 Distributed Database System Environment - Example Edmonton Boston •Edmonton (employees) •Boston Angestellte (employees) •Paris Projekte (projects) •Paris Angestellte (employees) •Edmont Projekte (projects) •Boston Projekte (projects) Paris Communication Network San Francisco •Paris Angestellte (employees) •San Francisco Angestellte (employees) •Paris Projekte (projects) •San Francisco Projekte (projects) •Boston Angestellte (employees) •Boston Projekte(projects) P.Brezany Institut für Scientific Computing – Universität Wien 25 Promises of DDBSs 2. Reliability Through Distributed Transactions • Distributed DBMSs are intended to improve reliability since they have replicated components and, thereby eliminate single points of failure. • The failure of a single site, or the failure of a communication link which makes one or more sites unreachable, is not sufficient to bring down the entire system. In the case of a distributed database, this means that some of the data may be unreachable, but with proper care, users may be permitted to access other parts of the dist. database. • The „proper care“ comes in the form of support for distributed transactions. P.Brezany Institut für Scientific Computing – Universität Wien 26 Promises of DDBSs 3. Improved Performance 1. A distributed DBMS fragments the conceptual database, enabling data to be stored in close proximity to its points of use. 2. The inherent parallelism of dist. systems may be exploited for inter-query and intra-query parallelism. • • P.Brezany Inter-query parallelism results from the ability to execute multiple queries at the same time. Intra-query parallelism is achieved by breaking up a single query into a number of subqueries each of which is executed at a different site, accessing a different part of the distributed database. Institut für Scientific Computing – Universität Wien 27 Promises of DDBSs 4. Easier System Expansion • In a distributed environment, it is much easier to accommodate increasing database sizes. • Major system overhauls are seldom necessary; expansion can usually be handled by adding processing and storage power to the network. • It may be possible to obtain a linear increase in „power“, since this also depends on the overhead of distribution. • It normally costs much less to put together a system of smaller computers with the equivalent power of a single big machine. P.Brezany Institut für Scientific Computing – Universität Wien 28 Problem Areas • • • • • • P.Brezany Distributed database design Distributed query processing Distributed directory management Distributed concurrency control Distributed deadlock management Heterogeneous databases Institut für Scientific Computing – Universität Wien 29 Distributed DBMS Architecture • The architecture of a system defines its structure. • This means that the components of the system are identified, the function of each component is specified, and the interrelationships and interactions among these components are defined. • In this part we classify DBMS architectures. These are idealized views – many research and commercially available systems may deviate from them. • We use a classification (next slides) that organizes the systems as characterized with respect to (1) the autonomy of local systems, (2) their distribution, and (3) their heterogeneity. P.Brezany Institut für Scientific Computing – Universität Wien 30 Autonomy • Autonomy refers to the distribution of control, not of data. It indicates the degree to which individual DBMSs can operate independently. • Requirements of an autonomous system: – The local operations of the individual DBMSs are not affected by their participaion in a multidatabase system. – The manner in which the individual DBMSs process queries and optimize them should not be affected by the execution of global queries that access multiple databases. – System consistency or operation should not be compromised when individual DBMSs join or leave the multidatabase confederation. P.Brezany Institut für Scientific Computing – Universität Wien 31 Distribution • Whereas autonomy refers to the distributed control, the distribution dimension of the taxonomy deals with data. • There are a number of ways DBMSs have been distributed. We abstract 2 alternative classes: – client/server distribution – peer-to-peer distribution (or full distribution) P.Brezany Institut für Scientific Computing – Universität Wien 32 Heterogeneity • Heterogeneity may occur in different forms: – – – – P.Brezany hardware data models query languages transaction management protocols Institut für Scientific Computing – Universität Wien 33 Architekturmodell P.Brezany Institut für Scientific Computing – Universität Wien 34 Architektur von DBMS •Client - Server Architektur (nicht interessant für diese LV) •Verteilte Datenbank Architektur •Multi Datenbank Architektur P.Brezany Institut für Scientific Computing – Universität Wien 35 Client/Server Architektur Hier gibt es typischerweise einen zentralen Datenbank-Server und eine größere Anzahl vernetzter Arbeitsplatzrechner, die keine relevanten Daten speichern. Der Benutzer am Arbeitsplatzrechner sieht die volle Funktionalität des DBMS. Das System verhält sich wie ein zentrales Datenbanksystem, die Kommunikation ist für den Benutzer transparent. P.Brezany Institut für Scientific Computing – Universität Wien 36 Client/Server Architektur (cont.) P.Brezany Institut für Scientific Computing – Universität Wien 37 Verteiltes Datenbanksystem • Hier gibt es mehrere Datenbankserver, wobei bestimmte Daten auf nur einem Rechner oder auch auf mehreren (replizit) gespeichert sein können. • Eine virtuelle Datenbank, deren Komponenten physisch in einer Anzahl unterschiedlicher, real existierender DBMS abgebildet werden. • Transaktionen können in diesem Fall über mehrere DBMS laufen. • Sammlung von Daten, die • Aufgrund gemeinsamer, verknüpfender Eigenschaften dem gleichen System angehören • Auf versch. Rechnern im Netzwerk verteilt sind • Wobei jeder Rechner seine eigene Datenbank besitzt • Autonom lokal Aufgaben abwickeln kann P.Brezany Institut für Scientific Computing – Universität Wien 38 Verteiltes Datenbanksystem (cont.) - gleichzeitige Benutzung der Rechenleistung mehrerer Rechner - Engpaß in zentralen Datenbanksystemen bei Zugriff auf die Daten wird vermieden, da die Daten verteilt sind (ggf. repliziert) - Daten werden von einem Datenbanksystem verwaltet - Verteilungstransparenz - Grundlage: 4-Ebenen-Schema-Architektur P.Brezany Institut für Scientific Computing – Universität Wien 39 Repetition: ANSI/SPARC Architecture Users External Schema External view Conceptual Schema Conceptual view Internal Schema Internal view The external view is concerned with how users view the database. An individual user‘s view represents the portion of the database that will be accessed by that user as well as the relationships that the user would like to see among the data. A view can be shared among a number of users. P.Brezany External view External view The conceptual schema is an abstract definition of the database – it is the „real view“ of the enterprise being modeled in the database. The requirements of individual applications or the restrictions of the physical storage media are not considered. The internal view deals with the physical definition and organization of data. The location of data on different storage devices and the access mechanisms used to reach and manipulate data are the issues dealt with at this level. Institut für Scientific Computing – Universität Wien 40 Verteiltes Datenbanksystem (cont.) externes Schema 1 ... externes Schema N glob. konzept. Schema lokales konzept. Schema lokales konzept. Schema ... lokales konzept. Schema lokales internes Schema lokales internes Schema ... lokales internes Schema 4 - Ebenen - Schema - Architektur P.Brezany Institut für Scientific Computing – Universität Wien 41 Functional Schematic of an Integrated Distributed DBMS Global directory (GD/D) permits the required global mappings. Local mappings are performed by a local directory/dictionary (LD/D) mappings. P.Brezany Institut für Scientific Computing – Universität Wien 42 User processor 1. Components of a Distributed DBMS The user interface handler is responsible for interpreting users commands and formatting the result data. The semantic data controller uses the integrity constraints and authorizations that are defined as part of the global conceptual schema to check if the user query can be processed. The global query optimizer and decomposer determines an execution strategy to minimize a cost function, and translates the global queries into local ones using the global and local conceptual schemas as well as the global directory. The distributed execution monitor coordinates the distributed execution of the user request. 2. 3. 4. Data processor 1. 2. 3. The local query optimizer is responsible for choosing the best access path (The term access path refers to the data structures and algorithms that are used to access data. A typical access path is an index on one or more attributes of a relation.) to acces any data item. The local recovery manager is responsible for making sure that the locak database remains consistent. The run-time support processor physically accesses the database according to the physical commands in the schedule generated by the query optimizer. P.Brezany Institut für Scientific Computing – Universität Wien 43 Multidatenbanksystem - Ein MDBS ist ein Verbund von mehreren Datenbanksystemen. - Das Konzeptionelle Schema repräsentiert nur den Teil von Daten, den die lokalen DBMS teilen wollen. - Auf jedes DBS können lokale Anwendungen zugreifen. - Jedes DBS kann Daten enthalten, welche keine Beziehung zu Daten anderer DBS haben. P.Brezany Institut für Scientific Computing – Universität Wien 44 Multidatenbanksystem GES LES LES LES GES GKS GES LES LES LKS 1 ... LKS n LIS 1 ... LIS n LES Modell mit globalem konzeptionellem Schema P.Brezany Institut für Scientific Computing – Universität Wien 45 Multidatenbanksystem (cont.) ES 1 ES 2 ES n Multidatabase layer Local system layer LKS 1 LKS 2 LKS 3 LIS 1 LIS 2 LIS 3 Modell ohne globales konzeptionelles Schema P.Brezany Institut für Scientific Computing – Universität Wien 46 Components of an MDBS P.Brezany Institut für Scientific Computing – Universität Wien 47 Directory Management Strategies - Alternatives P.Brezany Institut für Scientific Computing – Universität Wien 48

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Engineering