Download 1-PPT

Document related concepts

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Clusterpoint wikipedia , lookup

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

PL/SQL wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Hands-on Data Science
and OSS
Driving Business Value with
Open Source and Data Science
Kevin Crocker, @_K_C_Pivotal
#datascience, #oscon
Data Science Education Lead
Pivotal Software Inc. @pivotal
E
v
e
r
y
t
h
i
n
g
o
s
c
o
n
2
0
1
4
VM info
 Everything is ‘oscon2014’
 User:password –> oscon2014:oscon2014
 PostgreSQL 9.2.8 dbname -> oscon2014
 Root password -> oscon2014
 Installed software: postgresql 9.2.8, R, MADlib, pl/pythonu,
pl/pgpsl, anaconda ( /home/oscon2014/anaconda), pgadmin3,
Rstudio, pyMADlib, and more to come in v1.1
Objective of Data Science
DRIVE AUTOMATED
LOW LATENCY ACTIONS
IN RESPONSE TO
EVENTS OF INTEREST
What Matters: Apps. Data. Analytics.
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
What Matters: OSS at the core
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
End Game: Drive Business Value with OSS
interesting problems that can’t easily be solved with current
technology
Use (find) the right tool for the job
- If they don’t exist, create them
- Prefer OSS if it fits the need
- Drive business value through distributed, MPP analytics
- Operationalization (O16n) of your Analytics
Create interesting solutions that drive business value
PIVOTAL DATA SCIENCE
TOOLKIT
1
Find Data
Platforms
• Pivotal Greenplum
DB
• Pivotal HD
• Hadoop (other)
• SAS HPA
• AWS
2
3
Run Code
Interfaces
• pgAdminIII
• psql
• psycopg2
• Terminal
• Cygwin
• Putty
• Winscp
Write Code
Editing Tools
• Vi/Vim
• Emacs
• Smultron
• TextWrangler
• Eclipse
• Notepad++
• IPython
• Sublime
Languages
• SQL
• Bash scripting
• C
• C++
• C#
• Java
• Python
• R
4
Write Code for Big Data
In-Database
• SQL
• PL/Python
• PL/Java
• PL/R
• PL/pgSQL
5
Hadoop
• HAWQ
• Pig
• Hive
• Java
6
Visualization
• python-matplotlib
• python-networkx
• D3.js
• Tableau
Implement Algorithms
Libraries
• MADlib
Java
• Mahout
R
• (Too many to list!)
Text
• OpenNLP
• NLTK
• GPText
C++
• opencv
Show Results
Python
•
NumPy
•
SciPy
•
scikit-learn
•
Pandas
Programs
• Alpine Miner
• Rstudio
• MATLAB
• SAS
• Stata
• GraphViz
• Gephi
• R (ggplot2, lattice,
shiny)
• Excel
7
Collaborate
Sharing Tools
• Chorus
• Confluence
• Socialcast
• Github
• Google Drive &
Hangouts
A large and
varied tool box!
Toolkit?
This image was created
by Swami
Chandresekaran,
Enterprise Architect,
IBM.
He has a great article
about what it takes to
be a Data Scientist:
Road Map to Data
Scientist
http://nirvacana.com/tho
ughts/becoming-a-datascientist/
We need the right technology for every
step
Open Source At Pivotal
 Pivotal has a lot of open source projects (and people) involved in Open Source
 PostgreSQL, Apache Hadoop (4)
 MADlib (16), PivotalR (2), pyMADlib (4), Pandas via SQL (3),
 Spring (56), Groovy (3), Grails (3)
 Apache Tomcat (2) and HTTP Server (1)
 Redis (1)
 Rabbit MQ (4)
 Cloud Foundry (90)
 Open Chorus
 We use a combination of our commercial software and OSS to drive business value through
Data Science
Motivation
 Our story starts with SQL – so naturally we try to use SQL
for everything! Everything?
 SQL is great for many things, but it’s not nearly enough
–Straightforward way to query data
–Not necessarily designed for
data science
 Data Scientists know other
languages – R, Python, …
Our challenge
 MADlib
– Open source
– Extremely powerful/scalable
– Growing algorithm breadth
– SQL
 R / Python
– Open source
– Memory limited
– High algorithm breadth
– Language/interface purpose-designed for data science
 Want to leverage both the performance benefits of MADlib and the
usability of languages like R and Python
How Pivotal Data Scientists Select Which
Tool to Use
Yes
Build final set of
models in MADlib,
Chorus, or
PivotalR/PyMADlib
Prototype in
R/Python or directly
in MADlib, Chorus
Is the algorithm of
choice available in
MADlib or Chorus?
Optimized for algorithm performance,
scalability, & code overhead
No
Do opportunities for
explicit
parallelization
exist?
Yes
Leverage
Procedural
Language
extensions in
Pivotal (i.e. PL/R,
PL/Python)
No
Connect to Pivotal
via ODBC
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
MADlib
 MAD stands for:
 lib stands for library of:
– advanced (mathematical, statistical, machine learning)
– parallel & scalable
– in-database functions
 Mission: to foster widespread development of scalable
analytic skills, by harnessing efforts from commercial
practice, academic research, and open-source development
MADlib: A Community Project
Open Source: BSD License
• Developed as a partnership with multiple universities
Community
– University of California-Berkeley
– University of Wisconsin-Madison
– University of Florida
• Compatibile with Postgres, Greenplum Database, and Hadoop
via HAWQ
• Designed for Data Scientists to provide Scalable, Robust
Analytics capabilities for their business problems.
•
•
•
•
Homepage: http://madlib.net
Documentation: http://doc.madlib.net
Source:
https://github.com/madlib
Forum:
http://groups.google.com/group/madlib-user-forum
MADlib: Architecture
Generalized Linear
Models
Linear Systems
Matrix Factorization
Machine Learning
Algorithms
Core Methods
Support Modules
Random Sampling
Sparse Vectors
Data Type Mapping
Exception Handling
Logging and Reporting
Probability Functions
Array Operations
Linear Algebra
Memory Management
Boost Support
C++ Database Abstraction Layer
User Defined Functions
User Defined Types
User Defined Operators
Database Platform Layer
User Defined Aggregates
OLAP Window Functions
OLAP Grouping Sets
MADlib: Diverse User Experience
SQL
psql> madlib.linregr_train('abalone',
Python
from pymadlib.pymadlib import *
'abalone_linregr',
conn = DBConnect()
'rings',
mdl = LinearRegression(conn)
'array[1,diameter,height]');
lreg.train(input_table, indepvars, depvar)
psql> select coef, r2 from abalone_linregr;
-[ RECORD 1 ]----------------------------------------------
cursor = lreg.predict(input_table, depvar)
scatterPlot(actual,predicted, dataset)
coef | {2.39392531944631,11.7085575219689,19.8117069108094}
r2
| 0.350379630701758
Open Chorus
R
MADlib In-Database
Functions
Descriptive Statistics
Predictive Modeling Library
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards
• Regression
• Elastic Net Regularization
• Sandwich Estimators (Huber white,
clustered, marginal effects)
Matrix Factoriization
• Single Value Decomposition (SVD)
• Low-Rank
Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Affinity Analysis, Market
Basket)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Ensemble Learners (Random Forests)
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
Linear Systems
• Sparse and Dense Solvers
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Calling MADlib Functions: Fast Training, Scoring
 MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
 All the data can be used in one
model
 Built-in functionality to create of
multiple smaller models (e.g.
regression/classification grouped by
feature)
 Open-source lets you tweak and
extend methods, or build your own
MADlib model function
Table containing
training data
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’,
‘bedroom’);
Features included in the
model
Create multiple output
models (one for each value
of bedroom)
Table in which to
save results
Column containing
dependent variable
Calling MADlib Functions: Fast Training, Scoring
 MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
 All the data can be used in one
model
 Built-in functionality to create of
multiple smaller models (e.g.
regression/classification grouped by
feature)
 Open-source lets you tweak and
extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
MADlib model scoring function
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef
)as predict
FROM houses, houses_linregr m;
Table with data to be scored
Table containing model
K-Means Clustering
Clustering refers to the problem of partitioning a set of
objects according to some problem-dependent
measure of similarity. In the k-means variant, given n
points x1,…,xn∈ℝd, the goal is to position k centroids
c1,…,ck∈ℝd so that the sum of distances between
each point and its closest centroid is minimized. Each
centroid represents a cluster that consists of all points
to which this centroid is closest.
So, we are trying to find the centroids which minimize
the total distance between all the points and the
centroids.
K-means Clustering
Example Use Cases:
Which Blogs are Spam Blogs?
Given a user’s preferences, which
other blog might she/he enjoy?
What are our customers saying about
us?
What are our customers saying about us?
 Discern trends and categories
on-line conversations?
- Search for relevant blogs
- ‘Fingerprinting’ based on word
frequencies
- Similarity Measure
- Identify ‘clusters’ of documents
What are our customers saying about us?
Method
• Construct document histograms
• Transform histograms into document “fingerprints”
• Use clustering techniques to discover similar
documents.
What are our customers saying about us?
Constructing document histograms
 Parsing & extract html files
 Using natural language processing for
tokenization and stemming
 Cleansing inconsistencies
 Transforming unstructured data into structured
data
What are our customers saying about us?
“Fingerprinting”
- Term frequency of words within a document vs.
frequency that those words occur in all
documents
- Term frequency-inverse document frequency (tfidf weight)
- Easily calculated based on formulas over the
document histograms.
- The result is a vector in n-dim. Euclidean space.
K-Means Clustering – Training Function
The k-means algorithm can be invoked in four ways, depending on the
source of the initial set of centroids:
1. Use the random centroid seeding method.
2. Use the kmeans++ centroid seeding method.
3. Supply an initial centroid set in a relation identified by the rel_initial_centroids
argument.
4. Provide an initial centroid set as an array expression in the initial_centroids
argument.
Random Centroid seeding method
kmeans_random( rel_source,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
Kmeans++ centroid seeding method
kmeanspp( rel_source,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
Initial Centroid set in a relation
kmeans( rel_source,
expr_point,
rel_initial_centroids, -- this is the relation
expr_centroid,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
Initial centroid as an array
kmeans( rel_source,
expr_point,
initial_centroids, -- this is the array
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
K-Means Clustering – Cluster Assignment
1. After training, the cluster assignment for each data point can be computed with
the help of the following function:
closest_column( m, x )
Assessing the quality of the clustering
A popular method to assess the quality of the clustering is the silhouette
coefficient, a simplified version of which is provided as part of the k-means
module. Note that for large data sets, this computation is expensive.
The silhouette function has the following syntax:
simple_silhouette( rel_source,
expr_point,
centroids,
fn_dist
)
What are our customers saying about us? ANIMATED?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
What are our customers saying about us?
 innovation
 leader
 design
•bug
•installation
•speed
•graphics
•improvement
•download
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
Pivotal & R Interoperability
 In a traditional analytics workflow using R:
–Datasets are transferred from a data source
–Modeled or visualized
–Model scoring results are pushed back to the data source
 Such an approach works well when:
–The amount of data can be loaded into memory, and
–The transfer of large amounts of data is inexpensive and/or fast
 PivotalR explores the situation involving large data sets where
these two assumptions are violated and you have an R
background
Enter PivotalR
 Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
 Simple solution:
Translate R code into SQL
PivotalR Design Overview
RPostgreSQL
PivotalR
2. SQL to execute
1. R  SQL
3. Computation results
No data here
Database/Hadoop
w/ MADlib
Data lives here
PivotalR Design Overview
 Call MADlib’s in-database machine learning functions
directly from R
 Syntax is analogous to native R functions – for example,
madlib.lm() mimics the syntax of the native lm() function
 Data does not need to leave the database
 All heavy lifting, including model estimation & computation,
is done in the database
PivotalR Design Overview
 Manipulate database tables directly from R without needing
to be familiar with SQL
 Perform the equivalent of SQL’s ‘select’ statements
(including joins) in a syntax that is similar to R’s
data.frame operations
 For example: R’s ‘merge’  SQL’s ‘join’
PivotalR: Current Functionality
And more ... (SQL wrapper)
•+
-
*
/
%%
^
MADlib Functionality
•
•
•
•
•
•
•
•
Linear Regression
Logistic Regression
Elastic Net
ARIMA
Marginal Effects
Cross Validation
Bagging
summary on model objects
• Automated Indicator
Variable Coding
as.factor
• predict
%/%
• $ [ [[ $<- [<- [[<• == != > < >= <=
• & | ! • merge • is.na
• by
• sort
• preview
• db.data.frame • content
• as.db.data.frame
• c mean sum sd var min max
length colMeans colSums
•
• dim
names
http://github.com/gopivotal/PivotalR/
db.connect db.disconnect
db.list db.objects
db.existsObject delete
PivotalR Example
 Load the PivotalR package
– > library('PivotalR')
 Get help for a function
– > help(db.connect)
 Connect to a database
– > db.connect(host = “dca.abc.com”, user = “student01”, dbname =
“studentdb”, password = ”studentpw", port = 5432, madlib =
"madlib", conn.pkg = "RPostgreSQL", default.schemas = NULL)
 List connections
– > db.list()
PivotalR Example
 Connect to a table via db.data.frame function (note that the data remains in the
database and is not loaded into memory)
– > y <- db.data.frame(“test.abalone”, conn.id = 1, key =
character(0), verbose = TRUE, is.temp = FALSE)
 Fit a linear regression model (one model for each gender) and display it
– > fit <- madlib.lm(rings ~ . - id | sex, data = y)
– > fit # view the result
 Apply the model to data in another table (i.e. x) and compute mean-square-error
– lookat(mean((x$rings - predict(fit, x))^2))
PivotalR
PivotalR
 PivotalR is an R package you can download from CRAN.
- http://cran.r-project.org/web/packages/PivotalR/index.html
- Using Rstudio, you can install it with: install.packages("PivotalR")
 GitHub has the latest, greatest code and features but is less stable.
- https://github.com/gopivotal/PivotalR
 R front end to PostgreSQL and all PostgreSQL databases.
 R wrapper around MADlib, the open source library for in-database scalable analytics
 Mimics regular R syntax for manipulating R’s “data.frame”
 Provides R functionality to Big Data stored in-database or Apache Hadoop.
 Demo code: https://github.com/gopivotal/PivotalR/wiki/Example
 Training Video: https://docs.google.com/file/d/0B9bfZ-YiuzxQc1RWTEJJZ2V1TWc/edit
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
PL/R on Pivotal
 Procedural Language (PL/X)
– X includes R, Python, pgSQL, Java, Perl, C etc
– need to be installed on each database
 PL/R enables you to write PostgreSQL
and DB functions in the R language
 R installed on each segment of the
Pivotal cluster
 Parsimonious – R piggy-backs on
Pivotal’s parallel architecture
 Minimize data movement
SQL & R
PL/R on Pivotal
 Allows most of R’s capabilities. Basic guide: “PostgreSQL Functions by Example”
http://www.joeconway.com/presentations/function_basics.pdf
 In PostgreSQL and GPDB/PHD: Check which PL languages are installed in database:
select * from pg_language;
lanname | lanispl | lanpltrusted | lanplcallfoid | lanvalidator | lanacl
-----------+---------+--------------+---------------+--------------+-------internal | f
| f
|
0 |
2246 |
c
| f
| f
|
0 |
2247 |
sql
| f
| t
|
0 |
2248 |
plpgsql
| t
| t
|
10885 |
10886 |
plpythonu | t
| f
|
16386 |
0 |
plr
| t
| f
|
18975 |
0 |
(6 rows)
 PL/R is an “untrusted” language – only database superusers have the ability to create
UDFs with PL/R (see “lanpltrusted” column in pg_language table)
PL/R Example
 Consider the census dataset below (each row represents an individual):
– h_state = integer encoding which state they live in
– earns = their income
– hours = how many hours per week they work
– … and other features
 Suppose we want to build a model of income for each state separately
SQL
Models
PL/R Example
 Prepare table for PL/R by converting it into array form
-- Create array version of table
DROP TABLE IF EXISTS use_r.census1_array_state;
CREATE TABLE use_r.census1_array_state AS(
SELECT
h_state::text h_state,
array_agg(h_serialno::float8) h_serialno, array_agg(earns::float8) earns,
array_agg(hours::float8) hours, array_agg((earns/hours)::float8) wage,
array_agg(hsdeg::float8) hsdeg, array_agg(somecol::float8) somecol,
array_agg(associate::float8) associate, array_agg(bachelor::float8) bachelor,
array_agg(masters::float8) masters, array_agg(professional::float8) professional,
array_agg(doctorate::float8) doctorate, array_agg(female::float8) female,
array_agg(rooms::float8) rooms, array_agg(bedrms::float8) bedrms,
array_agg(notcitizen::float8) notcitizen, array_agg(rentshouse::float8) rentshouse,
array_agg(married::float8) married
FROM use_r.census1
GROUP BY h_state
) DISTRIBUTED BY (h_state);
PL/R Example
SQL & R
TN
Data
CA
Data
NY
Data
PA
Data
TX
Data
CT
Data
NJ
Data
IL
Data
MA Data
WA
Data
TN Model
CA Model
NY Model
PA Model
TX Model
CT Model
NJ Model
IL
Model
MA Model
WA
Model
PL/R Example
 Run linear regression to predict income in each state
– Define output data type
– Create PL/R function
SQL
Wrapper
Body of the
function in R
SQL Wrapper
Define
output
type
PL/R
function
PL/R Example
 Execute PL/R function
PL/R
 PL/R is not installed – I have to download the source and compile it
 Instructions can be found here
 http://www.joeconway.com/plr/doc/plr-install.html
 CHALLENGE: Download PL/R, compile it, and install it in PostgreSQL
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
Pivotal & Python Interoperability
 In a traditional analytics workflow using Python:
–Datasets are transferred from a data source
–Modeled or visualized
–Model scoring results are pushed back to the data source
 Such an approach works well when:
–The amount of data can be loaded into memory, and
–The transfer of large amounts of data is inexpensive and/or fast
 pyMADlib explores the situation involving large data sets where
these two assumptions are violated and you have a Python
background
Enter pyMADlib
 Challenge
Want to harness the familiarity of Python’s interface and the
performance & scalability benefits of in-DB analytics
 Simple solution:
Translate Python code into SQL
pyMADlib Design Overview
ODBC/JDBC
2. SQL to execute
1. Python  SQL
3. Computation results
No data here
Database/Hadoop
w/ MADlib
Data lives here
Simple solution: Translate Python code
ODBC/
into SQL
JDBC
Python 
SQL
SQL to execute MADlib
Model output
 All data stays in DB and all model estimation and heavy lifting done in DB by MADlib
 Only strings of SQL and model output transferred across ODBC/JDBC
 Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let
MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you
program in your favorite language – Python.
Hands-on Exploration
PyMADlib Tutorial –
IPython Notebook Viewer Link
http://nbviewer.ipython.org/5275846
Where do I get it ?
$pip install pymadlib
Pivotal, MADlib, R, and Python
 Pivotal & MADlib & R Interoperability
–PivotalR
–PL/R
 Pivotal & MADlib & Python Interoperability
–pyMADlib
–PL/Python
PL/Python on Pivotal
 Syntax is like normal Python function with function definition line replaced by
SQL wrapper
 Alternatively like a SQL User Defined Function with Python inside
 Name in SQL is plpythonu
– ‘u’ means untrusted so need to be superuser to create functions
SQL wrapper
Normal Python
SQL wrapper
CREATE FUNCTION pymax (a integer, b integer)
RETURNS integer
AS $$
if a > b:
return a
return b
$$ LANGUAGE plpythonu;
Returning Results
 Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
 Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS (
name text,
value integer
);
 Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer)
RETURNS named_value
AS $$
return [ name, value ]
# or alternatively, as tuple: return ( name, value )
# or as dict: return { "name": name, "value": value }
# or as an object with attributes .name and .value
$$ LANGUAGE plpythonu;
 For functions which return multiple rows, prefix “setof” before the return type
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
Sequence
Generator
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value
AS $$
return ([ name, 1 ], [ name, 2 ], [ name, 3])
$$ LANGUAGE plpythonu;
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value AS $$
for i in range(3):
yield (name, i)
$$ LANGUAGE plpythonu;
Accessing Packages
 In an MPP environment: To be available, packages must be installed on
every individual segment node.
– Can use “parallel ssh” tool gpssh to conda/pip install
 Then just import as usual inside function:
CREATE FUNCTION make_pair (name text)
RETURNS named_value
AS $$
import numpy as np
return ((name,i) for i in np.arange(3))
$$ LANGUAGE plpythonu;
Benefits of PL/Python
 Easy to bring your code to the data
 When SQL falls short leverage your Python (or R/Java/C)
experience quickly
 Apply Python across terabytes of data with minimal
overhead or additional requirements
 Results are already in the database system, ready for
further analysis or storage
Spring
What it is: Application Framework introduced as open source in 2003
Intention: Build enterprise-class Java applications more easily.
Outcomes:
1. Streamlined architecture, speeding application development by 2x and accelerating Time to
Value.
2. Portable, since Spring applications are identical for every platform.
Portable across multiple app servers
Spring Ecosystem
http://spring.io
XD
BOOT
GRAILS
Stream, Taps, Jobs
Bootable, Minimal, Ops-Ready
Full-stack, Web
INTEGRATION
BATCH
BIG DATA
WEB
Channels, Adapters,
Filters, Transformers
Jobs, Steps,
Readers, Writers
Ingestion, Export,
Orchestration, Hadoop
Controllers, REST,
WebSocket
DATA
RELATIONAL
NON-RELATIONAL
CORE
FRAMEWORK
http://projects.spring.io/spring-xd/
SECURITY
GROOVY
REACTOR
http://projects.spring.io/spring-data/
Spring XD - Tackling Big Data Complexity
Files
 One stop shop for
Analytics
Redis
Gemfire
...
–Spring Integration, Batch, Data
 XD = 'eXtreme Data'
Social
Workflow
Export
Jobs
 Built on existing Spring
Assets
Mobile
Big Data Programming Model
Ingest
–Data Ingestion
–Real-time Analytics
–Workflow Orchestration
–Data Export
Sensor
RDBMS
Compute
HDFS
OLAP
...
Groovy and Grails
 Dynamic Language for the JVM
 Inspired by Smalltalk, Python, and Ruby
 Integrated with the Java language & platform at every level
“Cloud”
 Means many things to many people.
 Distributed applications accessible over a network
 Typically, but not necessarily, The Internet
 An application and/or it's platform
 Resources on demand
 Inherently virtualized
 Can run in-house (private cloud) as well
 Hardware and/or software sold as a commodity
Pivotal Speakers at OSCON 2014
10:40am Tuesday, Global Scaling at the New York Times using RabbitMQ, F150
Alvaro Videla (RabbitMQ), Michael Laing (New York Times)
Cloud
11:30am Tuesday, The Full Stack Java Developer, D136
Joshua Long (Pivotal), Phil Webb (Pivotal)
Java & JVM | JavaScript - HTML5 - Web
1:40pm Tuesday, A Recovering Java Developer Learns to Go, E142
Matt Stine (Pivotal)
Emerging Languages | Java & JVM
Pivotal Speakers at OSCON 2014
2:30pm Tuesday, Unicorns, Dragons, Open Source Business Models And Other Mythical
Creatures, PORTLAND BALLROOM, Main Stage
Andrew Clay Shafer (Pivotal)
11:30am Wednesday, Building a Recommendation Engine with Spring and Hadoop, D136
Michael Minella (Pivotal)
Java & JVM
1:40pm Wednesday, Apache Spark: A Killer or Savior of Apache Hadoop?, E143
Roman Shaposhnik (Pivotal)
Sponsored Sessions
Pivotal Speakers at OSCON 2014
10:00am Thursday, Developing Micro-services with Java and Spring, D139/140
Phil Webb (Pivotal)
Java & JVM | Tools & Techniques
11:00am Thursday, Apache HTTP Server; SSL from End-to-End, D136
William A Rowe Jr (Pivotal)
Security
Data Science At Pivotal
 Drive business value by operationalizing Data Science models using a combination of our
commercial software (based on open source) and open source software.
 Open Source is at the core of what we do
Thank You!
Kevin Crocker
[email protected]
@_K_C_Pivotal
Data Science Education Lead
BUILT FOR THE SPEED OF BUSINESS