Download A[i+1] - RAD Lab

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
UC Berkeley
Using Statistical Machine Learning
in Cloud Computing
`
David Patterson, UC Berkeley
Reliable Adaptive Distributed Systems Lab
1
Image: John Curley http://www.flickr.com/photos/jay_que/1834540/
Datacenter is new “server”
•
•
•
•
“Program” == Web search, email, map/GIS, …
“Computer” == 1000’s computers, storage, network
Warehouse-sized facilities and workloads
New datacenter ideas (2007-2008): truck container (Sun),
floating (Google), datacenter-in-a-tent (Microsoft)
• How to enable innovation in new services without first
building & capitalizing a large company?
2
photos: Sun Microsystems & datacenterknowledge.com
Outline
• Cloud Computing
• RAD Lab
• Case Study 1: Director & SCADS
– Cloud storage and automatic management
• Case Study 2: Automatic Log Analysis
– Find anomalous behavior
• Case Study 3: Predicting Map Reduce
jobs
– Help with scheduling
• Summary
3
Cloud Computing is Hot
• 9/15/09 Federal CIO Vivek Kundra
embraces Cloud Computing
• “We’ve been building data center after
data center, acquiring application after
application, and frankly what it’s done is
drive up the cost of technology immensely
across the board. What we need is to find
a more innovative path in addressing
these problems.”
4
But...
What is cloud computing,
exactly?
5
“It’s nothing (new)”
“...we’ve redefined Cloud Computing to
include everything that we already do...
I don’t understand what we would do
differently ... other than change the
wording of some of our ads.”
Larry Ellison, CEO, Oracle (Wall Street
Journal, Sept. 26, 2008)
6
Above the Clouds:
A Berkeley View of Cloud Computing
abovetheclouds.cs.berkeley.edu
• 2/09 White paper by RAD Lab PI’s and students
– Clarify terminology around Cloud Computing
– Quantify comparison with conventional computing
– Identify Cloud Computing challenges & opportunities
• Why can we offer new perspective?
– Strong engagement with industry
– Users of cloud computing in our own research and
teaching in last 18 months
• Goal: stimulate discussion on what’s really new
– without resorting to weather analogies ad nauseam
7
Utility Computing Arrives
• Amazon Elastic Compute Cloud (EC2)
• “Compute unit” rental: $0.10-0.80/hr.
– 1 CU ≈ 1.0-1.2 GHz 2007 AMD Opteron/Xeon core
“Instances”
Platform
Cores
Small - $0.10 / hr
32-bit
1
1.7 GB
160 GB
Large - $0.40 / hr
64-bit
4
7.5 GB
850 GB – 2 spindles
64-bit
8
• N - $0.80 / hr
XLarge
Memory
Disk
15.0 GB 1690 GB – 3 spindles
• No up-front cost, no contract, no minimum
• Billing rounded to nearest hour; pay-as-you-go
storage also available
• A new paradigm (!) for deploying services?
8
What is it? What’s new?
• Old idea: Software as a Service (SaaS)
– Basic idea predates MULTICS
– Software hosted in the infrastructure vs. installed on local
servers or desktops; dumb (but brawny) terminals
• New: pay-as-you-go utility computing
– Illusion of infinite resources on demand
– Fine-grained billing: release == don’t pay
– Earlier examples: Sun, Intel Computing Services—longer
commitment, more $$$/hour, no storage
– Public (utility) vs. private clouds
9
Why Now (not then)?
• “The Web Space Race”: Build-out of extremely
large datacenters (10,000’s of commodity PCs)
– Build-out driven by growth in demand (more users)
=> Infrastructure software: e.g., Google File System
=> Operational expertise: failover, DDoS, firewalls...
– Discovered economy of scale: 5-7x cheaper than
provisioning a medium-sized (100’s machines) facility
• More pervasive broadband Internet
• Commoditization of HW & SW
– Fast Virtualization
– Standardized software stacks
10
Classifying Clouds
•
•
•
•
Instruction Set VM (Amazon EC2)
Managed runtime VM (Microsoft Azure)
Framework VM (Google AppEngine)
Tradeoff: flexibility/portability vs. “built in”
functionality
Lower-level,
Less managed
EC2
Higher-level,
More managed
Azure
AppEngine
11
Cloud Economics 101
• Cloud Computing User: Static provisioning
for peak - wasteful, but necessary for SLA
Machines
Capacity
$
Capacity
Demand
Demand
Time
Time
“Statically provisioned”
data center
“Virtual” data center
in the cloud
Unused resources
12
Cloud Economics 101
Machines
Capacity
Energy
• Cloud Computing Provider: Could save
energy
Capacity
Demand
Demand
Time
Time
“Statically provisioned”
data center
Real data center
in the cloud
Unused resources
13
Risk of Under Utilization
• Underutilization results if “peak” predictions
are too optimistic
Capacity
Resources
Unused resources
Demand
Time
Static data center
14
2
1
Time (days)
Capacity
Demand
Capacity
2
1
Time (days)
Demand
Lost revenue
3
Resources
Resources
Resources
Risks of Under Provisioning
3
Capacity
Demand
2
1
Time (days)
Lost users
3
15
New Scenarios Enabled by
“Risk Transfer” to Cloud
• “Cost associativity”: 1,000 CPUs for 1 hour same
price as 1 CPUs for 1,000 hours (@$0.10/hour)
– Washington Post converted Hillary Clinton’s travel
documents to post on WWW <1 day after released
– RAD Lab graduate students demonstrate improved
Hadoop (batch job) scheduler—on 1,000 servers
• Major enabler for SaaS startups
– Animoto traffic doubled every 12 hours for 3 days when
released as Facebook plug-in
– Scaled from 50 to >3500 servers
– ...then scaled back down
16
Cloud Computing &
Statistical Machine Learning?
• Before CC, only performance optimization
on mostly small scale systems
• CC  detailed cost-performance model
– Optimization more difficult with more metrics
• CC  Everyone can use 1000+ servers
– Optimization more difficult at large scale
• Economics rewards scale up AND down
– Optimization more difficult if add/drop servers
• SML as optimization difficulty increases
17
RAD Lab 5-year Mission
Enable 1 person to develop, deploy, operate
next -generation Internet application
• Key enabling technology: Statistical machine learning
– management, scaling, anomaly detection, performance
prediction, ...
• Highly interdisciplinary faculty & students
– PI’s: Fox/Katz/Patterson (systems/networks), Jordan (machine
learning), Stoica (networks & P2P), Joseph (systems/security),
Franklin (databases)
– 2 postdocs, ~30 PhD students, ~5 undergrads
18
RAD Lab Prototype:
System Architecture
Drivers
Drivers
Drivers
Automatic
Workload
Evaluation (AWE)
Director
Offered load,
resource
utilization, etc.
Training data
performance &
cost
models
Log
Mining
Chukwa & XTrace (monitoring)
New apps,
equipment,
global policies
(eg SLA)
SCADS
Chukwa trace coll.
local OS functions
Web 2.0 apps
web svc
Ruby on APIs
Rails environment
Chukwa trace coll.
local OS functions
VM monitor
19
Successes
1. Automatically add/drop servers to fit demand,
without violating Service Level Agreement (SLA)
2. Predict performance of complex software
system when demand is scaled up
3. Distill millions of lines of log messages into an
operator-friendly “decision tree” that pinpoints
“unusual” incidents/conditions
• Recurring theme: cutting-edge Statistical
Machine Learning (SML) works where simpler
methods have failed
20
Outline
• Cloud Computing
• RAD Lab
• Case Study 1: Director & SCADS
– Cloud storage and automatic management
• Case Study 2: Automatic Log Analysis
– Find anomalous behavior
• Case Study 3: Predicting Map Reduce
jobs
– Help with scheduling
• Summary
21
Automatic Management
of a Datacenter
• As datacenters grow, need to automatically
manage the applications and resources
– examples:
• deploy applications
• change configuration, add/remove virtual machines
• recover from failures
• Director:
– mechanism for executing datacenter actions
• Advisors:
– intelligence behind datacenter management
22
Director Framework
workload
model
performance
model
Advisor
Advisor
Advisor
Advisor
monitoring
data
Director
Drivers
config
Datacenter(s)
VM
VM
VM
VM
23
Director Framework
• Director
– issues low-level/physical actions to the
DC/VMs
• request a VM, start/stop a service
– manage configuration of the datacenter
• list of applications, VMs, …
• Advisors
– update performance, utilization metrics
– use workload, performance models
– issue logical actions to the Director
• start an app, add 2 app servers
24
What About Storage?
• Easy to imagine how to scale up and scale
down computation
• Database don’t scale down, usually run
into limits when scaling up
• What would it mean to have datacenter
storage that could scale up and down as
well so as to save money for storage in
idle times?
25
DC Storage Motivation
• Most popular websites follow the same
pattern
– Rapidly developed on SQLServer /
PostgreSQL / MySQL / etc.
– Become popular and realize scaling
limitations
– Build large, complicated ad-hoc systems to
deal with scaling limitations as they arise
• Websites that can’t scale fast enough lose
customers
SCADS: Scalable, ConsistencyAdjustable Data Storage
• Scale Independence - as the user base
grows:
– No changes to application
– Cost per user doesn’t increase
– Request latency doesn’t change
• Key Innovations
1.Performance safe query language
2.Declarative performance/consistency
tradeoffs
3.Automatic scale up and down using
machine learning
Scale Independence Arch
• Developers provide
performance safe
queries along with
consistency
requirements
• Use ML, workload
information, and
requirements to
provision proactively
via repartitioning
keys and replicas
SCADS Performance Model
(on m1.small, all data in memory)
5% writes
1% writes
SLA threshold
99th percentile
median
Low workload, low put rate
Outline
• Cloud Computing
• RAD Lab
• Case Study 1: Director & SCADS
– Cloud storage and automatic management
• Case Study 2: Automatic Log Analysis
– Find anomalous behavior
• Case Study 3: Predicting Map Reduce
jobs
– Help with scheduling
• Summary
31
Mining Console Logs for LargeScale System Problem Detection
• Why console logs?
• Console logs are everywhere
– In almost every software system
– Hand-picked information by developers
– Expressive, convenient to use
• Low programmer overhead
• Especially useful in large scale Internet services
– Open source code + in-house development
– Continuously changing system
“Detecting Large-Scale System Problem Detection by Mining Console
Logs,” Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan,
32
Symposium on Operating Systems Principles (SOSP), Oct 2009
Console logs use awkward
languages
HODIE NATUS EST RADICI FRATER*
today unto the Root a brother is born.
* "that crazy Multics error message in Latin."
http://www.multicians.org/hodie-natus-est.html
33
Console logs are not
operator friendly
Console Logs
Operators
grep
Perl scripts
search
• Problem – Don’t know what to look for!
• Console logs are intended for a single developer
• Assumption: log writer == log reader
• Today many developers => massive textual logs
• Our goal - Discover the most interesting log
messages without any prior input
34
Console logs are hard for
machines too
Machine
Parsing
Learning
Feature
Creation
Machine
Learning
Visualization
• Problem
• Highly unstructured, looks like free text
• Not able to capture detailed program state with texts
• Hard for operators to understand detection results
• Our contribution
• A general framework for processing console logs
• Efficient parsing and features
35
Key observations
• Parsing
– Console logs were inherentedly structured
• Determined by log printing statement
– Constant strings = markers of message structure
– Source code is usually available
• Features
• Common information reported in logs
• Identifiers and execution paths
• State variables
• Correlations among messages
• Many ways to group related log messages
• i.e. not just by time
36
Case studies
• Surveyed a number of software apps
– Linux kernel, OpenSSH, Apache, MySQL, Jetty
– Hadoop, Cassandra, Nutch …
– Parsing works on 20/22
• In this talk – Hadoop file system
– Per block operation logging (usually ignored)
• Experiment on EC2 cloud
– 203 nodes * 48 hours
– ~ 300 TB HDFS data (550,000 blocks)
– ~24 million lines of console logs
37
Step 1: Using source code analysis to help
log parsing
• Basic ideas
091012 03-00-00 INFO Creating file mydata
Log.info(“Creating file ” + filename);
Creating file (.*) [filename]
• Non-trivial in object oriented languages
– toString() method of objects
– Dynamic binding (sub-classing)
• Our Solution
– Analyze source code to abstract syntax tree (AST) level
– Pre-compute all possible message types
• Free text -> semistructured text
38
Step 2: Feature Message count vector
• Identifiers are common in logs
– file names, object keys, user ids
• Identifiers can be discovered automatically
– Have many distinct values
– Appear in multiple message types
– Reported many times
myfile:
Creating
Write
• Grouping by identifiers
– Similar to execution paths
Backup
Done
• Numerical representation of these “paths”
– Similar to Bag of words model in IR
39
Message count vector example
datanode_r16 | Receiving block blk_100 src: … dest:...
namenode_r10 | allocateBlock: blk_100
namenode_r10 | allocateBlock: blk_200
datanode_r16 | Receiving block blk_200 src: … dest:...
datanode_r14 | Receiving block blk_100 src: …dest:…
datanode_r16 | Received block blk_100 of size 49486737 from …
datanode_r14 | Received block blk_100 of size 49486737 from …
datanode_r16 | Error Receiving block blk_200 of size 49486737 from …
blk_100
0 2 21200200000000
blk_200
0 1 01200201 000000
40
Step 3: Mining
PCA detection
0221200201000000
55,000 vectors,
one per block
• This example - detect abnormal vectors
• Dimensions highly correlated
– Rare correlations indicate abnormal executions
– PCA separates normal pattern from abnormal,
making anomalies easy to detect
• From text -> structured feature
– With details of program execution
– Can apply many other machine learning
algorithms
41
PCA detection results
Seq
Event Description
1
2
3
4
5
6
7
8
9
10
11
Forgot to update namenode for deleted block
Write block exception then client give up
Failed at beginning, no block written
Over-replicate-immediately-deleted
Received block that does not belong to any file
Redundant addStoredBlock request received
Trying to delete a block, but the block no longer exists on data node
Empty packet for block
Exception in receiveBlock for block
PendingReplicationMonitor timed out
Other anomalies
Total (Out of 550,000 blocks)
Seq Description
1
2
False Positives
Normal background migration
Multiple replica ( for task / jobdesc files )
Total
Actual
Detected
4297
3225
2950
2809
1240
953
724
476
89
45
108
4297
3225
2950
2788
1228
953
650
476
89
45
107
16916
16808
False Positives
1397
349
1746
Making results easy for operators to understand!
42
Explaining detection results with
decision tree
writeBlock # received exception
>=1
0
# Starting thread to transfer block # to #
>=3
<=2
#: Got exception while serving # to #:#
0
>=1
1
0
addStoredBlock request received for # on
# size # But it does not belong to any file
>=1
1
0
# starting thread to transfer block # to #
>=1
0
0
#Verification succeeded for #
>=1
0
0
Receiving block # src: # dest: #
>=3
0
1
>=1
0
Unexpected error trying to delete block #\.
BlockInfo Not found in volumeMap
1
<=2
1
43
Summary: Mining Console Logs for
Large-Scale System Problem Detection
Extract
Detect
abnormal log segments
200 nodes,
>24 million lines of logs
Visualize
A single decision tree
to visualize system
behavior
Contributions: How to parse logs,
How to turn into features so ready for SML,
How to make results accessible to operators,
Which SML algorithm used matters less
44
Outline
• Cloud Computing
• RAD Lab
• Case Study 1: Director & SCADS
– Cloud storage and automatic management
• Case Study 2: Automatic Log Analysis
– Find anomalous behavior
• Case Study 3: Predicting Map Reduce
jobs
– Help with scheduling
• Summary
45
Using SML to Characterize,
Predict, Optimize Resources
Given a system (e.g. Web Service, multicore …),
its workload, and performance metrics:
• Answer “what-if” questions for
– Changes in workload (mix/rate/distribution)
– Changes in system configuration (hw/sw)
– Changes in both
• 3 Examples for Case Study 3:
– Parallel Database Performance Prediction
– Hadoop Job Performance Prediction
– Multi-core Performance Optimization (if time permits)
Problem Statement
Workload
SYSTEM
Behavior
Model
• Find relationships between workload and performance
• How to generate the model?
– Manually constructed queueing model
– Automatically constructed
• Regression, genetic algorithms, correlation analysis,…
– This talk: Canonical Correlation Analysis
Finding Correlations
Kernel Canonical Correlation
Analysis (KCCA)
• 2003, Bach & Jordan at Berkeley
• Find dimensions of maximal correlation
between kernelized datasets
– “Kernel functions” impose system-specific definition
of similarity
• Dimensionality reduction + clustering effect
Advantages:
– Relative distance between datapoints (defines
neighborhoods)
– Interpolation (preserves neighborhoods across both
datasets)
3 Challenges in using SML
1. Mapping function from traces to feature
vectors
•
•
Workload -> multidimensional space
System behavior -> multidimensional space
2. How to compare similarity of two feature
vectors
3. How to leverage results of SML
techniques to find actionable answer
Hadoop Job Performance Prediction
Jobs
Hadoop
Performance
51
Motivating Scenarios
• Which jobs should I schedule together to avoid
resource contention in Hadoop?
– Should I run my job now or wait till later?
• What is the optimal number of nodes to run a job
on?
– How many nodes should my job run on to meet the
deadline?
• Given observed behavior of a job run at small scale,
how will the job behave when scaled up?
– Are the performance bottlenecks the same?
– Which aspects of the system perform non-linearly?
52
Hadoop Facts
• Map-reduce jobs have diverse
performance profiles
- Each job has a different behavioral signature
- Input parameters affect scaling slopes of jobs
• Map performance is linear wrt number of
nodes
• Reduce complexity is job specific
• Shuffle step is primary source of job
unpredictability
53
SML Goals
• Predict runtime/resource requirements for jobs
– Same jobs, different configs
– Same configs, different jobs
• Identify which jobs can be co-scheduled
– minimize resource contention
– maximize shared data
• Create Map-Reduce benchmark
– Representative workloads
– Varying scale factors
54
Data and Testbed:
Facebook
• Multi-user
environment
• Data set spans 6
months
• Know job names
but not job source
code
• ~600 node cluster
• Nodes 1-300
– 16GB memory
– 5 map slots
– 5 reduce slots
• Nodes 301-600
– 8GB memory
– 5 map slots
– 0 reduce slots
Recurring Extract Transform
Load (ETL) Jobs
# of instances
66 unique jobs account for 99% of ETL jobs
JobID
Most repeated job
14000
12000
Total Time
10000
8000
6000
4000
2000
0
Same job, many configs
14000
12000
Total Time
10000
8000
6000
4000
2000
0
0
50
100
150
Number of Maps
200
250
X
Y
Job Descriptor
Vector
Hadoop
• Job config parameters
• Number of Maps
• Number of Reduces
• Data characteristics
• Map Input Bytes
• Locality
Performance
Vector
• Map Time
• Reduce Time
• Total Time
• Map Output Bytes
Predicting Job Performance
Actual time (ms)
Total Time
Predicted time (ms)
Actual bytes
Map Output Bytes
Predicted bytes
Multi-Core Performance
Optimization (if time permits)
Application
Optimization
Configurations
Multi-core
Performance
A. Ganapathi, K. Datta, A. Fox, D. Patterson, "A Case for Machine
Learning to Optimize Multicore Performance", First USENIX
Workshop on Hot Topics in Parallelism (HotPar '09), Berkeley, CA,
March 30-31, 2009.
Problem
Structured
Grids
Dense
Linear
Algebra
Motifs
+
Sparse
Linear
Algebra
Diverse
Sun
Intel/AMD
IBM
Multicore
Niagara2
x86
Blue Gene
Architectures
+
Compilers
alonexlc
icc
(No code tuning)
gcc
= Poor Performance!
Solution: Auto-tuning
Identify
motif-specific
optimizations
Generate code
variants based on
these optimizations
Benefits:
Problem:
• Portable across diverse
architectures
Optimization
• Easily scalable
Thread Count
• Tunable for any relevant metric
Domain Decomposition
• Minimizes programmer tuning
Software Prefetching
time
• Proven track record (e.g. ATLAS, Padding
SPIRAL, FFTW, OSKI)
Inner Loop
• Already showed speedups of
Total
1.5x to 5.6x
Search parameter
space for best
configuration
Parameters
Total
Configs
1
4
4
36
2
18
1
32
8
480
16
4x107
ML Goals
Goal: find best value for configuration parameters to
simultaneously optimize multiple performance metrics
 Require little architecture/application knowledge
 Provide actionable suggestions for improving
performance
 Simultaneously optimize for multiple performance
metrics (e.g. energy efficiency and cycle time)
 Use small set of samples to build model
Experimental Setup
2563 regular grid
x+1
z+1
y+1
y-1
(x,y,z)
z-1
x-1
Structured Grids (stencil codes)
•
•
3D 7-point stencil
•
•
3D 27-point stencil
Used in iterative PDE solvers for
image processing, diffusion etc.
For a given point, a stencil is a fixed
subset of nearest neighbors
A stencil code updates every point
in a regular grid by “applying a
stencil”
Stencil codes usually bandwidthbound
Architectures
• AMD Barcelona
2 sockets x 4 cores/socket, 2.30 GHz, compiler: gcc
• Intel Clovertown
2 sockets x 4 cores/socket, 2.66 GHz, compiler: icc
Potential Optimizations
TY
+X
NX(unit stride)
CY
Thread 0
CZ
RXRY
RZ
CX
Domain Decomposition
Processing
…
A[i-1]
A[i]
A[i+1]
Low Memory
Bandwidth
Retrieving
from DRAM
…
A[i+dist-1]
A[i+dist]
A[i+dist+1]
Thread 1
Conflict
Misses
…
NY
TY
CZ
NZ
+Z
+Y
TX
Parallelization
and
Capacity Misses
TX
Software Prefetching
Thread n
Array Padding
for i=1 to x {
for i=1
to xto
{ y{
for j=1
for j=1
to
{ z{
for i=1
to
{ y to
forxk=1
for k=1
for j=1
to
y to
{ z{
A[i,j,k]=…
A[i,j,k]=…
for k=1
} to z {
}}
A[i,j,k]=…
}}}
}}
}
Poor Functional
Unit Usage and
Scheduling
Inner Loop Opts.
3 Challenges in using ML
1. Mapping function from traces to feature
vectors
•
•
Configuration parameters ->
multidimensional space
System behavior -> multidimensional space
2. How to compare similarity of two feature
vectors
3. How to leverage results of ML
techniques to find actionable answer
Configuration Parameters
Parameter
Optimization
Type
Thread Count
Number of Threads
Block Size
Domain
Decomposition
Parameter Range
Number of
Configurations
NThreads
{20…23}
4
CX
{27…NX}
CY
{21…NY}
CZ
NZ
Name
36
Chunk Size
Software
Prefetching
Prefetching Type
Padding
Padding Size
{register block, plane, pencil}
3
{0, 25…29}
6
{0…31}
32
Prefetching Distance
Register Block Size
Inner Loop
Optimizations
Feature Vector
RX
{20…21}
RY
{20…21}
RZ
{20…23}
10
Statement Type
{complete, individual}
2
Read From Type
{array, variable}
2
Pointer Type
{fixed, moving}
2
Neighbor Index Type
{register block, plane, pencil}
3
FMA-like Instructions
{yes, no}
2
X
Threads
Block
Size CX
Block
Size CY
Block
Size CZ
Padding
Size
Prefetch
type
Prefetch
distance
Statement
Type
4
32
128
256
32
Plane
64
individual
Performance Metrics
Counter
Description
PAPI_TOT_CYC
Cycles per thread per job
PAPI_L1_DCM
PAPI_L2_DCM
L1 data cache misses per thread
L2 data cache misses per thread
PAPI_TLB_DCM
PAPI_CA_SHR
TLB misses per thread
Accesses to shared cache lines
PAPI_CA_CLN
PAPI_CA_ITV
Accesses to clean cache lines
Cache interventions
Power meter
Watts consumed per second
Feature Vector
Y
(total cycles x # of flops) / (clk rate x # of watts)
Total
Cycles
L1_DCM
L2_DCM
TLB_DCM
CA_SHR
CA_CLN
CA_ITV
Energy
Efficiency
1.9E7
2.4E5
1.5E5
1.2E4
1.2E5
1.4E4
1.2E3
2.3E4
Kernel functions
Threads
Block
Size CX
Block
Size CY
Block
Size CZ
Padding
Size
Prefetch
type
Prefetch
distance
Statement
Type
4
32
128
256
32
Plane
64
Individual
2
64
128
256
32
Plane
64
Individual
X1
X2
2
64
128
256
32
Pencil
64
complete
X3
Are X1 and X2 more similar than X2 and X3?
Measure of similarity
Numeric Columns: Gaussian Kernel
K(xi,xj) = exp(-||xi-xj||2 /)
Non-numeric Columns:
K(xi,xj) = {1 if xi=xj, 0 if xi≠xj }
Similarity(X1, X2) = average(K(X1i, X2i))
Stencil Code
X
Y
Configuration
Feature Vector
Performance
Feature Vector
Platform
kernel function
kernel function
• performance
• compiler
optimizations
1 … N counters
1 …N
1
1
• parameters
for memory blocking
•
power
meter readings
:
:
KY
:
• Ksoftware
prefetching
:
X
N
N
KCCA
0 K K
Kernel functions
X
Y
A
KXKX
0
=
A
K K
0 Gaussian
B
B
• Numeric Columns:
0 K Kernel
K
• Non-numeric
Columns: K(xi,xj) = {1 if xi=xj, 0KifY*Bxi≠xj }
KX*A
• Similarity(X1, X2) = average(K(X1i, X2i))
Y
X
Y
Y
Finding Optimal
Configurations
KCCA Data Space
Raw Data Space
Configuration features
Performance Metrics
X
Y
Inverse Image
???
Problem
KX*A
KCCA
Nearest
Neighbors
KY*B
Results
75
Auto-tuning Time
• Exhaustive Search:
– 4x107 configs x 0.08 s/trial x 5 trials > 180 days
• Our Technique:
– Training data: 1500 configs x 0.08 s/trial x 5 trials = 10 minutes
– Training time using KCCA: 90 minutes
– Genetic algorithm: 243 configs x 0.08 s/trial x 5 trials = 1.6 min
– Total Time < 2 hrs
76
Summary: SML Strengths
1. Train vs. write the logic (handle SW churn)
2. Learns how to handle and make policy
during transitions between steady states
(unlike queueing theory)
3. Coping with a complex cost function
(unlike simple control theory)
4. Finding trends, needles in data haystack
5. Fast enough to run online (today)
– Infeasible algorithms from 1960s fast in 2008
– Computers cheap enough OK to “just” monitor
77
Summary: SML
Accommodations
• System data usually text vs. numbers
– Need to turn text into features
– Need to find distance function for close vs. far
• SML is not perfect
– System must live with <100% accuracy, false
positives
• Will be asked “Yes, I see that SML works,
but why didn’t you try ‘simpler’ Algorithm X?”
– For all X
– Simpler = I know X, I don’t know SML?
78
Conclusion
• Cloud Computing will transform IT industry
– Pay-as-you-go utility computing leveraging
economies of scale of Cloud provider
– Anyone can create/scale next eBay, Twitter…
• Cloud Computing  better management:
cost-performance, scale, scale up/down
• SML: making sense from lots of data
• RAD Lab applying SML to New Cloud
Computing challenges: Director, SCADS,
Map Reduce Scheduling, Log Analysys 79
Acknowledgments
• Peter Bodik for Director/SCADS experiments
• Wei Xu for Log Analysis
• Archana Ganapathi for Map Reduce and
Multicore Performance Prediction
• 30 PhD students and 8 faculty of RAD Lab
• RAD Lab founding sponsors Google,
Microsoft, and Sun Microsystems and
affiliates Amazon Web Services, Cloudera,
Cisco, Facebook, HP, NetApp, SAP, VMware
80
Backup Slides
81
Evaluation Benchmarks
– ordered the optimizations
– applied them consecutively
– more information: Datta et al.,
“Stencil Computation Optimization and
Auto-tuning on State-of-the-Art Multicore
Architectures”, Supercomputing 2008.
Opt. #2 Parameters
• “No Optimization”: running naïve code
• “Expert Optimized”:
Opt. #1 Parameters
• “Random Raw Data”: best performing point in raw data
• “Genetic on Raw Data”: permute configs for top three
best performing points in raw data
Cross Validation
Prediction Accuracy:
1 – [Σ(predictedi – actuali)2 / Σ(actuali – actualmean)2]
Test Set
Map Time
Reduce
Time
Total Time
Map Output
Bytes
mod4 (old)
0.60
0.80
0.70
0.50
mod3
0.84
0.85
0.86
0.94
mod2
0.81
0.76
0.83
0.94
mod1
0.91
0.77
0.88
0.97
mod0 (new)
0.96
0.89
0.97
0.91
Train with Sampled Subset
Prediction Accuracy:
1 – [Σ(predictedi – actuali)2 / Σ(actuali – actualmean)2]
Training
Set
Map Time
Reduce Time
Total Time
Map Output
Bytes
All old data
0.96
0.89
0.97
0.91
sample1
0.93
0.87
0.94
0.85
sample2
0.62
0.94
0.81
0.81
sample3
0.76
0.91
0.91
0.72
sample4
0.64
0.83
0.78
0.71
sample5
0.69
0.87
0.82
0.74
Microsoft’s Chicago
Modular Datacenter
85
The Million Server
Datacenter
• 24000 sq. m housing 400 containers
– Each container contains 2500 servers
– Integrated computing, networking, power,
cooling systems
• 300 MW supplied from two power
substations situated on opposite sides of
the datacenter
• Dual water-based cooling systems
circulate cold water to containers,
eliminating need for air conditioned rooms86
High workload, low put rate
High workload, high put rate
Low workload, low put rate
2020 IT Carbon Footprint
820m tons CO2
360m tons CO2
2007 Worldwide IT
carbon footprint:
2% = 830 m tons CO2
Comparable to the
global aviation
industry
Expected to grow
to 4% by 2020
260m tons CO2
90
Energy & Cloud Computing?
• Cloud Computing saves Energy?
• Don’t buy machines for local use that are
often idle
• Better to ship bits as photons over fiber vs.
ship electrons over transmission lines to
spin disks, power processors locally
– Clouds use nearby (hydroelectric) power
– Leverage economies of scale of cooling, power
distribution
91
Energy & Cloud Computing?
• Techniques developed to stop using idle
servers to save money in Cloud Computing
can also be used to save power
– Up to Cloud Computing Provider to decide
what to do with idle resources
• New Requirement: Scale DOWN and up
– Who decides when to scale down in a
datacenter?
– How can Datacenter storage systems improve
energy?
92
Hybrid / Surge Computing
• Keep a local “private cloud” running same
protocols as public cloud
• When need more, “surge” onto public
cloud, and scale back when need fulfilled
• Saves energy (and capital expenditures)
by not buying and deploying power
distribution, cooling, machines that are
mostly idle
93