Download 6338

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Introduction to
Matei Zaharia
Outline
The big data problem
Spark programming model
User community
Newest addition: DataFrames
The Big Data Problem
Data is growing faster than computation
speeds
Growing data sources
» Web, mobile, scientific, …
Cheap storage
» Doubling every 18 months
Stalling CPU speeds
Examples
Facebook’s daily logs: 60 TB
1000 genomes project: 200 TB
Google web index: 10+ PB
Cost of 1 TB of disk: $30
Time to read 1 TB from disk: 6 hours (50 MB/s)
The Big Data Problem
Single machine can no longer process or even
store all the data!
Only solution is to distribute over large
clusters
Google Datacenter
How do we program this thing?
Traditional Network
Programming
Message-passing between nodes
Really hard to do at scale:
» How to divide problem across nodes?
» How to deal with failures?
» Even worse: stragglers (node is not failed, but slow)
Almost nobody does this for “big data”
To Make Matters Worse
1) User time is also at premium
» Many analyses are exploratory
2) Complexity of analysis is growing
» Unstructured data, machine learning, etc
Outline
The big data problem
Spark programming model
User community
Newest addition: DataFrames
What is Spark?
Fast and general engine that can extends
Google’s MapReduce model
High-level APIs in Java, Scala, Python, R
Collection of higher-level libraries
Spark Programming Model
Part of a family of data-parallel models
» Other examples: MapReduce, Dryad
Restricted API compared to message-passing:
“here’s an operation, run it on all the data”
» I don’t care where it runs (you schedule that)
» Feel free to run it twice on different nodes
Key Idea
Resilient Distributed Datasets (RDDs)
» Immutable collections of objects that can be stored in
memory or disk across a cluster
» Built with parallel transformations (map, filter, …)
» Automatically rebuilt on failure
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Base Transformed
RDD
RDD
results
Cache 1
Worker
errors = lines.filter(s => s.startswith(“ERROR”))
messages = errors.map(s => s.split(‘\t’)(2))
messages.cache()
Driver
tasks Block 1
Action
messages.filter(s => s.contains(“foo”)).count()
Cache 2
messages.filter(s => s.contains(“bar”)).count()
Worker
...
Cache 3
Result: full-text search of Wikipedia
in 1 sec (vs 40 s for on-disk data)
Worker
Block 3
Block 2
Fault Tolerance
RDDs track lineage info to rebuild lost data
file.map(record => (record.type, 1))
.reduceByKey((x, y) => x + y)
.filter((type, count) => count > 10)
Input file
map
reduce
filter
Fault Tolerance
RDDs track lineage info to rebuild lost data
file.map(record => (record.type, 1))
.reduceByKey((x, y) => x + y)
.filter((type, count) => count > 10)
Input file
map
reduce
filter
Example: Logistic Regression
Goal: find best line separating two sets of
points
random initial line
target
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = Vector.random(D)
for (i <- 1 to iterations) {
gradient = data.map(p =>
(1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x
).reduce((x, y) => x + y)
w -= gradient
}
println(“Final w: ” + w)
Running Time (s)
Logistic Regression Results
4000
3500
3000
2500
2000
1500
1000
500
0
110 s / iteration
Hadoop
Spark
1
5
10
20
Number of Iterations
30
first iteration 80 s
further iterations 1
s
Demo
Higher-Level Libraries
Spark
Spark SQL
structured data Streaming
real-time
Spark
MLlib
machine
learning
GraphX
graph
Combining Processing Types
// Load data using SQL
points = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(t => (model.predict(t.location), 1))
.reduceByWindow(“5s”, (a, b) => a + b)
Outline
The big data problem
Spark programming model
User community
Newest addition: DataFrames
Spark Users
1000+ deployments, clusters up to 8000 nodes
Applications
Large-scale machine learning
Analysis of neuroscience data
Network security
SQL and data clustering
Trends & recommendations
Which Libraries Do People
Use?
Spark SQL
DataFrames
69%
62%
Spark Streaming
58%
MLlib + GraphX
58%
75% of users use 2 or more components
50% use three or more components
Which Languages Are Used?
2014 Languages Used
2015 Languages Used
71%
84%
58%
38%
38%
31%
18%
Community Growth
Contributors / Month to Spark
160
Contributors
140
120
100
Most active open source
project in big data
80
60
40
20
0
2010
2011
2012
2013
2014
2015
Outline
The big data problem
Spark programming model
User community
Newest addition: DataFrames
Challenges with Functional
API
Looks high-level, but hides many semantics
of computation from engine
» Functions passed in are arbitrary blocks of code
» Data stored is arbitrary Java/Python objects
Users can mix APIs in suboptimal ways
Example Problem
pairs = data.map(word => (word, 1))
groups = pairs.groupByKey()
Materializes all groups
as lists of integers
groups.map((k, vs) => (k, vs.sum))
Then promptly
aggregates them
Challenge: Data
Representation
Java objects often many times larger than data
class User(name: String, friends: Array[Int])
User(“Bobby”, Array(1, 2))
User
0x…
0x…
int[]
String
3
1 2
0 5 0x…
char[]
5
Bobby
DataFrames / Spark SQL
Efficient library for working with structured data
» Two interfaces: SQL for data analysts and external
apps, DataFrames for complex programs
» Optimized computation and storage underneath
Spark SQL added in 2014, DataFrames in 2015
Spark SQL Architecture
Data
Frames
SQL
Logical
Plan
Data
Source
API
Optimizer Physical
Plan
Code
RDDs
Generator
Catalog
…
DataFrame API
DataFrames hold rows with a known schema
and offer relational operations through a DSL
c = HiveContext()
users = c.sql(“select * from users”)
ma_users = users[users.state == “MA”]
ma_users.count()
Expression AST
ma_users.groupBy(“name”).avg(“age”)
ma_users.map(lambda row: row.user.toUpper())
API Details
Based on data frame concept in R, Python
» Spark is the first to make this declarative
Integrated with the rest of Spark
» ML library takes DataFrames as input & output
» Easily convert RDDs ↔ DataFrames
Google trends for “data frame”
What DataFrames Enable
1. Compact binary representation
•
Columnar, compressed cache; rows for
processing
2. Optimization across operators (join
reordering, predicate pushdown, etc)
3. Runtime code generation
Performance
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
0
2
4
6
8
Time for aggregation benchmark (s)
10
Performance
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
0
2
4
6
8
Time for aggregation benchmark (s)
10
Data Sources
Uniform way to access structured data
» Apps can migrate across Hive, Cassandra, JSON, …
» Rich semantics allows query pushdown into data sources
users[users.age > 20]
select * from users
Spark
SQL
Examples
{
“text”: “hi”,
“user”: {
“name”: “bob”,
“id”: 15 }
}
JSON:
select user.id, text from tweets
JDBC:
tweets.json
select age from users where lang = “en”
Together:
select t.text, u.age
from tweets t, users u
where t.user.id = u.id
and u.lang = “en”
select id, age from
users where lang=“en”
Spark
SQL
{JSON}
To Learn More
Get Spark at spark.apache.org
» You can run it on your laptop in local mode
Tutorials, MOOCs and news:
sparkhub.databricks.com
Use cases: spark-summit.org