Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Matei Zaharia Outline The big data problem Spark programming model User community Newest addition: DataFrames The Big Data Problem Data is growing faster than computation speeds Growing data sources » Web, mobile, scientific, … Cheap storage » Doubling every 18 months Stalling CPU speeds Examples Facebook’s daily logs: 60 TB 1000 genomes project: 200 TB Google web index: 10+ PB Cost of 1 TB of disk: $30 Time to read 1 TB from disk: 6 hours (50 MB/s) The Big Data Problem Single machine can no longer process or even store all the data! Only solution is to distribute over large clusters Google Datacenter How do we program this thing? Traditional Network Programming Message-passing between nodes Really hard to do at scale: » How to divide problem across nodes? » How to deal with failures? » Even worse: stragglers (node is not failed, but slow) Almost nobody does this for “big data” To Make Matters Worse 1) User time is also at premium » Many analyses are exploratory 2) Complexity of analysis is growing » Unstructured data, machine learning, etc Outline The big data problem Spark programming model User community Newest addition: DataFrames What is Spark? Fast and general engine that can extends Google’s MapReduce model High-level APIs in Java, Scala, Python, R Collection of higher-level libraries Spark Programming Model Part of a family of data-parallel models » Other examples: MapReduce, Dryad Restricted API compared to message-passing: “here’s an operation, run it on all the data” » I don’t care where it runs (you schedule that) » Feel free to run it twice on different nodes Key Idea Resilient Distributed Datasets (RDDs) » Immutable collections of objects that can be stored in memory or disk across a cluster » Built with parallel transformations (map, filter, …) » Automatically rebuilt on failure Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base Transformed RDD RDD results Cache 1 Worker errors = lines.filter(s => s.startswith(“ERROR”)) messages = errors.map(s => s.split(‘\t’)(2)) messages.cache() Driver tasks Block 1 Action messages.filter(s => s.contains(“foo”)).count() Cache 2 messages.filter(s => s.contains(“bar”)).count() Worker ... Cache 3 Result: full-text search of Wikipedia in 1 sec (vs 40 s for on-disk data) Worker Block 3 Block 2 Fault Tolerance RDDs track lineage info to rebuild lost data file.map(record => (record.type, 1)) .reduceByKey((x, y) => x + y) .filter((type, count) => count > 10) Input file map reduce filter Fault Tolerance RDDs track lineage info to rebuild lost data file.map(record => (record.type, 1)) .reduceByKey((x, y) => x + y) .filter((type, count) => count > 10) Input file map reduce filter Example: Logistic Regression Goal: find best line separating two sets of points random initial line target Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = Vector.random(D) for (i <- 1 to iterations) { gradient = data.map(p => (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x ).reduce((x, y) => x + y) w -= gradient } println(“Final w: ” + w) Running Time (s) Logistic Regression Results 4000 3500 3000 2500 2000 1500 1000 500 0 110 s / iteration Hadoop Spark 1 5 10 20 Number of Iterations 30 first iteration 80 s further iterations 1 s Demo Higher-Level Libraries Spark Spark SQL structured data Streaming real-time Spark MLlib machine learning GraphX graph Combining Processing Types // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a + b) Outline The big data problem Spark programming model User community Newest addition: DataFrames Spark Users 1000+ deployments, clusters up to 8000 nodes Applications Large-scale machine learning Analysis of neuroscience data Network security SQL and data clustering Trends & recommendations Which Libraries Do People Use? Spark SQL DataFrames 69% 62% Spark Streaming 58% MLlib + GraphX 58% 75% of users use 2 or more components 50% use three or more components Which Languages Are Used? 2014 Languages Used 2015 Languages Used 71% 84% 58% 38% 38% 31% 18% Community Growth Contributors / Month to Spark 160 Contributors 140 120 100 Most active open source project in big data 80 60 40 20 0 2010 2011 2012 2013 2014 2015 Outline The big data problem Spark programming model User community Newest addition: DataFrames Challenges with Functional API Looks high-level, but hides many semantics of computation from engine » Functions passed in are arbitrary blocks of code » Data stored is arbitrary Java/Python objects Users can mix APIs in suboptimal ways Example Problem pairs = data.map(word => (word, 1)) groups = pairs.groupByKey() Materializes all groups as lists of integers groups.map((k, vs) => (k, vs.sum)) Then promptly aggregates them Challenge: Data Representation Java objects often many times larger than data class User(name: String, friends: Array[Int]) User(“Bobby”, Array(1, 2)) User 0x… 0x… int[] String 3 1 2 0 5 0x… char[] 5 Bobby DataFrames / Spark SQL Efficient library for working with structured data » Two interfaces: SQL for data analysts and external apps, DataFrames for complex programs » Optimized computation and storage underneath Spark SQL added in 2014, DataFrames in 2015 Spark SQL Architecture Data Frames SQL Logical Plan Data Source API Optimizer Physical Plan Code RDDs Generator Catalog … DataFrame API DataFrames hold rows with a known schema and offer relational operations through a DSL c = HiveContext() users = c.sql(“select * from users”) ma_users = users[users.state == “MA”] ma_users.count() Expression AST ma_users.groupBy(“name”).avg(“age”) ma_users.map(lambda row: row.user.toUpper()) API Details Based on data frame concept in R, Python » Spark is the first to make this declarative Integrated with the rest of Spark » ML library takes DataFrames as input & output » Easily convert RDDs ↔ DataFrames Google trends for “data frame” What DataFrames Enable 1. Compact binary representation • Columnar, compressed cache; rows for processing 2. Optimization across operators (join reordering, predicate pushdown, etc) 3. Runtime code generation Performance DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 Time for aggregation benchmark (s) 10 Performance DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 Time for aggregation benchmark (s) 10 Data Sources Uniform way to access structured data » Apps can migrate across Hive, Cassandra, JSON, … » Rich semantics allows query pushdown into data sources users[users.age > 20] select * from users Spark SQL Examples { “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 } } JSON: select user.id, text from tweets JDBC: tweets.json select age from users where lang = “en” Together: select t.text, u.age from tweets t, users u where t.user.id = u.id and u.lang = “en” select id, age from users where lang=“en” Spark SQL {JSON} To Learn More Get Spark at spark.apache.org » You can run it on your laptop in local mode Tutorials, MOOCs and news: sparkhub.databricks.com Use cases: spark-summit.org