Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distribution, Data, Deployment Software Architecture Convergence in Big Data System Ian Gorton and John Klein Presenter: Weicong Ma Agenda Ø Background Ø The Challenges for Big Data Ø Big Data Application Characteristics Ø Example: Clinical Application Ø Systematic Design Using Tactics Ø Discussions Background The exponential growth of data Data-intensive (big data) software system Open source and commercial data management technologies are developed Requires: q Design tradeoffs spanning the distributed software, data and deployment architectures q Extending traditional software architecture design knowledge to account for the tight coupling that exists in scalable systems Background “…Distribution, data and deployment architectural qualities can no longer be effectively considered separately…” Distribution: level of the design which deals with the high-level organization of computational elements and the interactions between those elements Data: a set of rules, policies, standards and models that govern and define the type of data collected and how it is used, stored, managed and integrated Deployment architecture: depicts the mapping of a logical architecture to a physical environment The Challenges for Big Data Traditionally, Recently, SQL Database technology NoSQL products emerge • Vertical scaling • Horizontally scaling – across clusters of low-cost, moderate-performance servers – faster processors and bigger disks as workloads or storage requirements increase • Strictly defined normalized data models • Strong data consistency guarantees • SQL standards – partitioning and replicating datasets across a cluster • Schemaless and intentionally denormalized data model • Weak consistency • Proprietary APIs to expose data management mechanisms The Challenges for Big Data: NoSQL q Each technology supports its own query mechanism q Programmer are responsible for formulating query executions q Programmer are responsible for combining results from different data collections q Programmer must manage consistency when concurrent updates occur and design applications to tolerate stale data due to latency in update replication The Challenges for Big Data: Tradeoffs Distributed database fundamental quality constrains – by Eric Brewer CAP Theorem Consistency Availability Partition tolerance Consistency: each server returns right response to each request Availability: each request eventually receive a response Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machine from communicating with others The Challenges for Big Data: Tradeoffs Practical interpretation of CAP Theorem – by Daniel Abadi Availability Else Consistency PACELC Partition Consistency Latency If there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)? The Challenges for Big Data: Others • Achieving high scalability and availability leads to high distributed systems • The abstraction of a single system image, with transactional writes and consistent reads using SQL-like query languages, is difficult to achieve scale • Each NoSQL product embodies a specific set of quality attribute tradeoffs – polyglot persistence: using different technologies to store different datasets in a single system • Required hardware resources grow as data volumes grow. Many widely used software architecture patterns are unsuitable The Challenges for Big Data: Scale q Change our designs’ problem space Problems: partial failures, communication latencies, concurrency, consistency and replication Scalable applications must treat failures as common events - Replicate data - Architecture components are stateless, replicated, and tolerant of failures of dependent services q Economics based implications “At a very large scale, small optimizations in resource use can lead to very large cost reductions..” q Testing and fault diagnosis - Comprehensively validating code before deployment is impossible - Testing at scale (advanced monitoring and logging) Big Data Application Characteristics Big data system must be able to … 1. sustain write-heavy workloads 2. Deal with variable request loads 3. Support computation-intensive analytics 4. Have high availability These requirements crosscut the distribution, data and deployment architectures Big Data Application Characteristics These requirements crosscut the distribution, data and deployment architectures Example: Elasticity requires… • Processing capacity that can be acquired from the execution platform on demand • Policies and mechanisms to appropriately start and stop services as the application load varies • A database architecture that can reliably satisfy queries under an increased load Example: Healthcare Example Patient demographics (name, insurance provider …) • Immediately visible at local site where the data was modified • Delay acceptable at other sites – eventual replica consistency Diagnostic-test Results (blood, image test results…) • Immediately visible everywhere – strong replica consistency Example: Healthcare Example MangoDB prototype solution Patient demographics (name, insurance provider …) • Writes durable on the primary replica • Reads can be directed to the closest replica for low latency Diagnostic-test Results (blood, image test results…) • Writes durable on all replicas • Reads are insensitive to partitions Example: Healthcare Example Scale drives a consolidation of concerns so that the distribution, data and deployment architectural qualities can no longer be effectively considered separately… Today, healthcare informatics application: • Atop SQL databases • Hidden physical data model and deployment topology from developers • Separates concerns between the application and database Shift to NoSQL: • Handle faults depend on physical data distribution • Low-level infrastructure concerns now must be explicitly handled in application logic “ Low-level infrastructure concerns, traditionally hidden under the database interface, must be explicitly handled in big data system. Systematic Design Using Tactics Tactics: elemental design decisions, embodying architectural knowledge of how to satisfy one design concern of a quality attribute. In designing an architecture: Systematically select and apply a sequence of architecture tactics Tactics catalogs enable reuse of the architectural knowledge, but existing catalogs don’t contain tactics specific to big data system. Systematic Design Using Tactics Systematic Design Using Tactics Systematic Design Using Tactics References [1] Gorton, I., & Klein, J. (2015). Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems. IEEE Software, 32(3), 78-85. doi:10.1109/ms.2014.51 [2] Magee, Jeff, Naranker Dulay, Susan Eisenbach, and Jeff Kramer. "Specifying Distributed Software Architectures." Software Engineering — ESEC '95 Lecture Notes in Computer Science (1995): 137-53. Web. [3] Chapter 5 Designing a Deployment Architecture. (2004). Retrieved October 30, 2016, from https://docs.oracle.com/cd/E19199-01/817-5759/ dep_architect.html [4] @. (n.d.). What is Data Architecture? - Definition from Techopedia. Retrieved October 30, 2016, from https://www.techopedia.com/definition/ 6730/data-architecture [5] NoSQL Databases: An Overview. (2014). Retrieved October 13, 2016, from https://www.thoughtworks.com/insights/blog/nosql-databasesoverview [6] Abadi, D. (2012). Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story. Computer,45(2), 37-42. doi:10.1109/mc.2012.33 Images References [1] https://www.getfilecloud.com/blog/2014/08/leading-nosqldatabases-to-consider/#.WBP_a5MrL-Y [2] https://en.wikipedia.org/wiki/HipHop_for_PHP [3] http://blog.soprasteria.com/aeronautics-big-data-helps-acceleratetest-flights/ [4] http://www.edureka.co/blog/big-data-applications-in-healthcare/ [5] https://cs.uwaterloo.ca/~kmsalem/courses/cs743/F14/slides/ ShuZhang.pdf Discussions Strengths and Weakness Strengths: • Goals and motivations are clearly identified • Gives a comprehensive overview of the challenges, concerns and solutions in designing software architecture for big data problems • Uses concrete example of a healthcare system to explain and demonstrate hard to understand concepts Weakness: • Too much background information • It lacks in-depth details. e.g. How we handled the tradeoffs in real world problems? • It only shows limited architecture tactics working for limited problem domains • Future work lacks in-depth discussion Related Papers [1] Abadi, D. (2012). Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story. Computer, 45(2), 37-42. doi:10.1109/mc.2012.33 [2] Klein, J., & Gorton, I. (2015). Design Assistant for NoSQL Technology Selection. Proceedings of the 1st International Workshop on Future of Software Architecture Design Assistants - FoSADA '15. doi:10.1145/2751491.2751494 [3] Kruchten, P. (1995). The 4+1 View Model of architecture. IEEE Software,12(6), 42-50. doi:10.1109/52.469759 Future Work Two complementary directions: Expand the collection of architecture tactics and encoding them in an environment that supports navigation between quality attributes and tactics, making crosscutting concerns for design choices explicit Link tactics to design solutions based on specific big data technologies, enabling architects to rapidly relate a particular technology’s capabilities to a specific set of tactics Questions • Is there any SQL database trying to solve big data problems like their NoSQL counterparts as mentioned in this paper? • How to expanding our collection of architecture tactics? Any idea? • Is it easy to link tactics to design solutions based on specific big data technologies? • Will it be easier if we use these tactics while designing an software system for big data application in real world? Is it easy to do the tradeoffs? Thanks! Any questions?