Download Python In The Cloud

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Python In The Cloud
PyHou MeetUp, Dec 17th 2013
Chris McCafferty, SunGard Consulting Services
Overview
•
•
•
•
•
•
•
What is the Cloud?
What is Big Data?
Big Data Sources
Python and Amazon Web Services
Python and Hadoop
Other Pythonic Cloud providers
Wrap-up
What Is The Cloud
• I want 40 servers and I want them NOW
• I want to store 100Tb of data cheaply and
reliably
• We can do this with Cloud technologies
What is Big Data
• “Three Vs”
– Volume
– Variety
– Velocity
• Genome: sequencing machines throw off
several TB per day. Each.
• Hard drive performance is often the killer
bottleneck, both reading and writing
What is NOT Big Data
• Anything where the whole data set can be
held in memory on a single standard instance
• Data that can be held straightforwardly in a
traditional relational database
• Problems where most of the data can be
trivially excluded
• There are many challenging problems in the
world – but not all need Cloud or Big Data
tools to solve them
To The Cloud!
• Amazon Web Services is the 800lb gorilla in this
space
– Start here if in doubt
• Other options are RackSpace, Microsoft Azure,
(PiCloud/Multyvac?)
• You can also spin up some big iron very cheaply
– Current AWS big memory spec is cr1.8xlarge
– 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network
– $3.50 per hour
Geo Big Data Sources
• NASA SRTM data is on the large side
• NASA recently released a huge set of data directly into
the cloud: NEX
– Earth Sciences data sets
• Made available on Amazon Web Services public
datasets
• Available on S3 at:
– s3://nasanex/NEX-DCP30
– s3://nasanex/MODIS
– s3://nasanex/Landsat
• There are many, many geo data sets available now
(NOAA Lidar, etc)
Time for some code
• Example - Use S3 browser to look at new
NASA NEX data
• Let’s download some with boto package
• Quickest to do this from an Amazon data
centre
• See DemoDownloadNasaNEX.py
Weather & Big Data Sources
• Good public weather and energy data
• It's hard to move data around for free: just try!
• Power grids shed many GB of public data a day
– Historical data sets form many Terabytes
• Weather data available from NOAA
– QCLD: Hourly, daily, and monthly summaries for
approximately 1,600 U.S. locations.
– ASOS data contains sensor data at one-minute
intervals. 5 min intervals available too.
• 900 stations, 3-4MB per day, 12 years of data = 11-15TB data
set.
Why go to the cloud
• Cheap - see AWS pricing here
– spot pricing of m1.medium normally ~1c/hr
• The cloud is increasingly where the (public) data
will reside
• Pay as you go, less bureaucracy
• Support for Big Data technologies out of the box
– Amazon Elastic Compute Cloud (EC2) gives you a
Hadoop cluster with minimal
• Host a big web server farm or video streaming
cluster
Python on AWS EC2
• AWS = Amazon Web Services. The Big Cloud
• EC2 = Elastic Cloud Compute
• Let’s run up an instance and see what we have
available
• See this script as one way to upgrade to Python
2.7
• Note absence of high-level packages like NumPy,
matplotlib and Pandas
• It would be very useful to have a very high-level
Python environment…
StarCluster
• Cluster management in AWS, written by a group at MIT
• Convenient package to spin up clusters (Hadoop or other)
and copy across files
• Machine images (AMIs) for high-level Python environments
(NumPy, matplotlib, Pandas, etc)
• Not every high-level library is there
– No sklearn (SciKit-Learn, machine learning)
– But easier to pip-install with most pre-requisites already there
•
•
•
•
Sun Grid Engine: Job Management
Hadoop
Boto plugin
dumbo… and much more
Python's Support for AWS
• boto - interface to AWS (Amazon Web Services)
• Hadoop Streaming - use Python in MapReduce
tasks
• mrjob - Framework that wraps Hadoop Streaming
and uses boto
• pydoop- wraps Hadoop Pipes, which is a C++ API
into Hadoop Map Reduce
• Write Python in User-Defined Functions in Pig,
Hive
– Essentially wraps MapReduce and Hadoop Streaming
Boto - Python Interface to AWS
•
•
•
•
•
Support for HDFS
Upload/download from Amazon S3 and Glacier
Start/stop EC2 instances
Manage users through IAM
Virtually every API available from AWS is
supported
• django-storages uses boto to present an S3
storage option
• See http://docs.pythonboto.org/en/latest/
• Make sure you keep your AWS key-pair secure
Another Code Example – upload
• Example where we merge many files together
and upload to S3
• Merge files to avoid the Small Files Problem
• Note use of retry decorator (exponential
backoff)
• See CopyToCloud.py and
MergeAndUploadTxOutages.py
What is
?
• A scalable data and job manager suitable for
MapReduce jobs
• Core technologies date from early 2000s at
Google
• Retries failed tasks, redundant data, good for
commodity hardware
• Rich ecosystem of tools including NoSQL
databases, good Python support
• Example, let’s spin up a cluster of 30 machines
with StarCluster
Hadoop Scales Massively
Hadoop Streaming
• Hadoop passes incoming data in rows on stdin
• Any program (including Python) can process
the rows and emit to stdout
• Logging and errors to stderror
Hadoop Streaming - Echo
• Useful example that can be used for
debugging
• Tells you what Hadoop is actually passing your
task
• See echo.py
• Similar example firstten.py peeks at the first
ten lines then stops
• Useful for debugging
Hadoop Parsing Example
• Python's regex support makes it very good for parsing
unstructured data
• One of the keys in working with Hadoop and Big Data is
getting it into a clean row-based format
• Apply 'schema on read'
• Transmission Data from PJM is updated here every 5
mins: https://edart.pjm.com/reports/linesout.txt
• Needs cleaning up before we can use it for detailed analysis
- note multi-line format
• Script split_transmission.py
• Watch out for Hadoop splitting input blocks in the middle
of a file
Alternatives to AWS
• Picloud offers open source software enabling you to run
large computational clusters
– Just acquired by DropBox
– Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr
– Doesn't offer many of the things Amazon does (AMIs, SMS) but
great for computation or a private cloud
• Disco is MapReduce implemented in Python
– Started life at Nokia
– Has its own Distributed Filesystem (like HDFS)
• Or roll your own cluster in-house with pp (parallel python)
• StarCluster Sun Grid Engine on other vendor/in-house
• Google App Engine…?
PiCloud
• Acquired by DropBox Nov 2013
• DropBox will probably come out with its own
cloud compute offering in 2014
• As of Dec 2013, no new sign-ups
• Existing customers encouraged to migrate
to Multyvac
• Feb 25th 2014 PiCloud will switch off
• The underlying PiCloud software is still open
source
Conclusions
• For cheap compute power and cheap storage, look to
the cloud
• Python is well-supported in this space
• Consider being close to your data: in the same cloud
– Moving data is expensive and slow
• Leverage AWS with tools like boto, StarCluster
• Beware setting up complex environments: installing
packages takes time and effort
• Ideally, think Pythonicly – use the best tools to get the
job done
Links
• Good rundown on the Python ecosystem around
Hadoop from Jan 2013:
– http://blog.cloudera.com/blog/2013/01/a-guide-topython-frameworks-for-hadoop/
• Early vision for PiCloud (YouTube Mar 2012)
– http://www.youtube.com/watch?v=47NSfuuuMfs
• Disco MapReduce Framework from PyData
– http://www.youtube.com/watch?v=YuLBsdvCDo8
– PuTTY tool for windows
• Some AWS & Python war stories:
– http://nz.pycon.org/schedule/presentation/12
Thank you
• Chris McCafferty
• http://christophermccafferty.com/blog
• Slides will be at:
• http://christophermccafferty.com/slides
• Contact me at:
• [email protected][email protected]