Download Resilient Distributed Datasets

CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services Resilient Distributed Datasets Presenter: Mounika Nimmagadda Key words: • Transformation • Action • Lineage Motivation • RDD’s are motivated by two types of applications 1) Iterative algorithms 2) Interactive data mining Motivation • Existing frameworks like map reduce will share data between computations by writing it to the stable storage systems • So we need efficient methods for sharing data across computations to improve efficency Challenges • Providing fault tolerance • Existing abstractions offer an interface based on fine-grained updates across machines • Fault tolerance is provided by replicating data across machines. RDD • RDD’s can only be built using coarse grained transformations like map, reduce, join etc • We can recover the lost partitions on failure by using lineage Ex: Console log mining • Using spark the operator can load just the error messages from the logs into memory across a set of nodes and query them interactively. • The following is the scala code: lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("MySQL")).count() We have base RDD, transformed RDD and actions are also applied in this example. Spark runtime: Logistic Regression Ex: Page Rank • The initial page rank would be one and then we update page rank on each iteration to α/N + (1 − α)∑ci The following is the code written in spark for page rank The following is the lineage graph created because of the above program Questions 1)“…individual RDDs are immutable…” What does it mean by being “immutable”? What benefits does this property of RDD bring? Immutable means not modifiable. It provides fault tolerance at low cost 2)When an RDD is being created (new data are being written into it), can the data in the RDD be read for computing before the RDD is completely created? No. Its not possible as the dataset is not completely formed and cannot be used for any operations. 3) “This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.“ “To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state.“ Why does using RDD help to provide efficient fault tolerance? Or why does coarse-grained transformation help with the efficiency? A dataset consists of input data and a set of transformations which can be applied to many data records. This helps us in efficient fault recovery. 4) “In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations.” What’s the consequence if a user does not explicitly request persistence of an RDD? The RDD can be removed from the memory. 5) Explain Figure 1 about a lineage graph. Thankyou

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Resilient Distributed Datasets