Download Resilient Distributed Datasets: A Fault-Tolerant Abstraction

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley. Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ). MapReduce in Hadoop Resilient Distributed Datasets (RDD) ● ● Immutable, partitioned collection of records Created by deterministic coarse-grained transformations ● Materialized on action ● Fault-tolerant through lineage ● Controllable persistence and partitioning Example: Log mining val file = spark.textFile(“hdfs://…”) val errors = file.filter( line => line.contains(“ERROR”) ).cache() // Count all the errors errors.count() // Count errors mentioning MySQL errors.filter(line => line.contains(“MySQL”)).count() // Fetch the MySQL errors as an array of strings errors.filter(line => line.contains(“MySQL”)).collect() Example: Logistic Regression val points = spark.textFile(…).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println(“Final separating plane: “ + w) Example: PageRank links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } Representation abstract def compute(split: Split): Iterator[T] abstract val dependencies: List[spark.Dependency[_]] abstract def splits: Array[Split] val partitioner: Option[Partitioner] def preferredLocations(split: Split): Seq[String] Scheduling Evaluation: PageRank Scalability Fault Recovery (k-means) Behavior with Insufficient RAM (logistic regression) User Applications ● Conviva, data mining (40x speedup) ● Mobile Millenium, traffic modeling ● Twitter, spam classification ● ... Expressing other Models ● MapReduce, DryadLINQ ● Pregel graph processing ● Iterative MapReduce ● SQL Conclusion ● ● RDDs are efficient, general and fault-tolerant abstraction for cluster computing 20x faster then Hadoop for memory bound applications ● Can be used for interactive data mining ● Available as Open Source at http://spark-project.org

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Resilient Distributed Datasets: A Fault-Tolerant Abstraction