Download Resilient Distributed Datasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CSE 6350 File and Storage System Infrastructure in Data
centers Supporting Internet-wide Services
Resilient Distributed Datasets
Presenter: Mounika Nimmagadda
Key words:
• Transformation
• Action
• Lineage
Motivation
• RDD’s are motivated by two types of applications
1) Iterative algorithms
2) Interactive data mining
Motivation
• Existing frameworks like map reduce will share data between
computations by writing it to the stable storage systems
• So we need efficient methods for sharing data across
computations to improve efficency
Challenges
• Providing fault tolerance
• Existing abstractions offer an interface based on fine-grained
updates across machines
• Fault tolerance is provided by replicating data across
machines.
RDD
• RDD’s can only be built using coarse grained transformations
like map, reduce, join etc
• We can recover the lost partitions on failure by using lineage
Ex: Console log mining
• Using spark the operator can load just the error messages
from the logs into memory across a set of nodes and query
them interactively.
• The following is the scala code:
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("MySQL")).count()
We have base RDD, transformed RDD and actions are also
applied in this example.
Spark runtime:
Logistic Regression
Ex: Page Rank
• The initial page rank would be one and then we update page
rank on each iteration to
α/N + (1 − α)∑ci
The following is the code written in spark for page rank
The following is the lineage graph created because of the above
program
Questions
1)“…individual RDDs are immutable…” What does it mean by
being “immutable”? What benefits does this property of RDD
bring?
Immutable means not modifiable. It provides fault tolerance
at low cost
2)When an RDD is being created (new data are being written
into it), can the data in the RDD be read for computing before
the RDD is completely created?
No. Its not possible as the dataset is not completely formed
and cannot be used for any operations.
3) “This allows them to efficiently provide fault tolerance by logging the
transformations used to build a dataset (its lineage) rather than the
actual data.“ “To achieve fault tolerance efficiently, RDDs provide a
restricted form of shared memory, based on coarse-grained
transformations rather than fine-grained updates to shared state.“ Why
does using RDD help to provide efficient fault tolerance? Or why does
coarse-grained transformation help with the efficiency?
A dataset consists of input data and a set of transformations which can
be applied to many data records. This helps us in efficient fault
recovery.
4) “In addition, programmers can call a persist method to
indicate which RDDs they want to reuse in future operations.”
What’s the consequence if a user does not explicitly request
persistence of an RDD?
The RDD can be removed from the memory.
5) Explain Figure 1 about a lineage graph.
Thankyou